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(57) Abstract 

This invention provides a method and appa- 
ratus for detecting common spans within one or 
more data blocks by partitioning the blocks (figure 
4) into subblocks and searching the group of sub- 
Mocks (figure 12) (or tbeir corresponding hashes 
(figure 13)) far duplicates. Blocks can be parti- 
tioned into sut>blocks using a variety of methods, 
including methods that place subblock boundaries 
at fixed positions (figure 3), methods that place sub- 
block boundaries at data-dependent positions (figure 
3), and methods that yield multiple overiapping sub- 
blocks (figure 6). By comparing the hashes of sub- 
blocks, ccHnmon spans of one or more blocks can 
be identified without ever having to compare the 
blocks or subblocks themselves (figure 13). This 
leads to several apf^ications including an incremen- 
tal backup system that backs up changes rather than 
changed files (figure 25). a utility that determines 
the similarities and differences between two files 
(figure 13), a file system tiiat stores each unique 
subblock at most once (figure 26). and a communi- 
cations system that eliminates the need to transmit 
subblocks already possessed by the receiver (figure 
19). 
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Method For Partitioning 

A Block of Data 
Into Subblocks And For 
Storing And Communicating 
Such Subblocks 

INTRODUCTION 

The present invention provides a method and apparatus for identifying identical 
subblocks of data within one or more blocks of data and of communicating and 
storing such subblocks in an efficient manner. 

BACKGROUND 

Much the massive amount of information stored, communicated, and manipulated 
bv modern computer systems is duplicated within the same or a related computer 
system. It is commonplace, for example, for computers to store many slightly dif- 
fering versions of the same document. It is also commonplace for data transmitted 
during a backup operation to be almost identical to the data transmitted during 
the previous backup operation. Computer networks also must rei)catedly carry the 
same or similar data in accordance the requirements of their users. 

Despite the obvious benefits that would flow from a reduction in the redundancy of 
communicated and stored data, few computer systems i)erform any such optimiza- 
tion. Some instances can be found at the application level, one example being the 
class of incremental backup utilities that save only those files that have changed 
since the most recent backup. However, even these utilities do not attempt to ex- 
ploit the significant similarities between old and new versions of files, and bet wren 
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files sharing other close semantic ties. This kind of redundancy can be approached 
only by analysing the contents of the files. 

The present invention addresses the potential for reducing redundancy by providing 
an efficient method for identifying identical portions of data within a group of blocks 
of data, and for using this identification to increase the efficiency of systems that 
store and conmiunicate data. 

SUMMARY OF THE INVENTION 

To identify identical portions of data within a group of blocks of data, the blocks 
must be analysed. In a simple aspect of the invention, the blocks are divided into 
fixed-length (e.g. 5r2-byte) subblocks and these subblocks arc compared with each 
other so as to identify all identical subblocks. This knowledge can then be used to 
manage the blocks in more efficient ways. 

Unfortunately, the partitioning of blocks into fixed-length subblocks does not always 
provide a suitable framework for tlu^ recognition of duplicated portions of data, as 
identical portions of data can occur in different sizes and ])laces witliin a group of 
blocks of data. Figure 1 shows how division into fixed-size subl)locks fails to generate 
identical subblocks in two blocks whosi- only difference is the insertion of a single 
byte ( 'X' ). A comparison of the two groups of subblocks would reveal no identical 
pairs of subl)locks. 

In a more sophisticated aspect of the invention, the blocks are partitione<l at bound- 
aries determined by the content of the data itself. For example, the block could be 
divided at each point at which the preceding three l)ytes hash lo a particular con- 
stant value. Figure 2 shows how such a partitioning could turn out. and contrasts 
it with a fixed-length partitioning. 

0 
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The fact that a partitioning is data dependent does not imply that it must incorfX)- 
rate any knowledge of the syntax or semantics of the data. So long as the boundaries 
are positioned in a manner dependent on the local data content, identical subblocks 
are likely to be formed from identical portions of data, even if tiie two portions are 
not identically aligned relative to the start of their enclosing blocks (Figure 3). 

Once the group of blocks has been partitioned into subblocks, the resulting group of 
subblocks can be manipulated in a manner that exploits the occurrence of duplicate 
subblocks. This leads to a variety of applications, some of which are listed below. 
However, the application of a further aspect of the invention leads to even greater 
benefits. 

In a further aspect of the invention, the hash of one or more subblocks is calcu- 
lated. The hash function can be an ordinary hash function or one providing cryp- 
tographic strength. The hash function maps each subblock into a small tractable 
value (e.g. 128 bits) that provides an identity of the subblock. These hashes can 
usually bo manipulated more efficiently than their correspotiding subblocks. 

Some a]>pIications of aspects of this invention are: 

Fine-grained incremental backups: C-onventional incremental 
l)ackup technology uses the file as the unit of backup. However, in 
practice many large files change only slightly, resulting in a wasteful 
re-transmission of changed files. By storing the hashes of subblocks of 
the previous versions of files, the transmission of unchanged subblocks 
can be eliminated. 

Communications: By providing a framework for communicating the 
hashes of subblocks. the invention can eliminate the transmission of sub- 
l>locks already possessed by the receiver. 
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Differences: The invention could be used as the basis of a program that 
determines the areas of similarity and difference between two blocks. 

Low-redundancy file system: Data stored in a file system can be par- / 
titioned into subblocks whose hashes can be compared so as to eliminate 
the redundant storage of identical subblocks. 

Virtual memory: Virtual memory could be organized by subblock us- 
ing a table of hashes to determine if a subblock is somewhere in memory. 

Clarification Of Terms 

The term block and subblock both refer, without limitation, to finite blocks or 
infinite blocks (sometimes called streams) of zero or more bits or bytes of digital data. 
Although the two different terms ("block"" and "subblock") essentially describe the 
same substance (digital data), the two different terms have been employed so as 
to indicate the role that a particular piece of data is playing. The term ''block'" is 
usually used to refer to raw data to be manipulated b\- aspects of the invention. The 
term 'subblock" is usually used to refer to a part of a block. 

The term partition has its usual meaning of exhaustively dividing an entity into 
mutually exclusive parts. However, within this patent si)ecification. thr term also 
includes: 

• Analyses in which only one or more parts are analysed. 

• Analyses in which multii)le overlapping subblocks are formed. 

-A natural number is a non-negative integer (0. 1. 2. 3. -1. o. . . .). 
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Where the phrase zero or more is used, this phrase is intended to encompass the 
degenerate case where the objects being enumerated are not considered at all, as 
well as tlie case where zero or more objects are used. 

BRIEF DESCRIPTION 

The following aspects of this invention are numbered for reference purposes. The 
terms "block"* and "subblock"" refer to blocks and subblocks of digital data. 

1. In an aspect of the invention, the invention provides a method for partitioning a 
block b into one or more subblocks, the method using the component: 

(i) a deterministic or non-deterministic function F that returns one of at least two 
values, and whose arguments include at least a block of A bits and a block of B 
bits, where -4 and B are natural numbers; 

and comprising tiie step of: 

a. Baising the positions of subblock l>oundaries on the positions h in the block for 
which F{bk-A • * • f>t,,bft^i . . .6^+^) falls within a predetermined subclass of the set of 
possible function result values. 

Noli: The specification of ibis aspecf (and each other- specification of an asp<c1 involviiuj 
a function F) encompasses the dcgcuerate casi in which either A or B is zero and thi 
function takes (without limitation) just one of th( tuo arguments described. Such spicifi- 
caiions also include the case of functions F thai do ne^t use some hits of their arguments, 
A function F that bases its calculation solely on (say) 6x_3 and bi^-^y it^ould fall under the 
classes of F constrained by the ceindition A > 3 and B > 'l. 

2. In a further aspect of tlie invention, the invention provides a method for locating 
the nearest subblock boundary on a ])arlicular side of a particular ]>osition p within 
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a block, the method using the same components as aspect 1. but replacing step (a) 
with: 

a. Evaluatmg F(6p_^ . . .6p,6p+i . . .6^+^) for increasing (or decreasing) /> until the 
result of F falls within a predetermined subclass of the set of possible function result 
values, the position of the resultant boundary being based on this position. 

3. In a further aspect of the invention, the invention provides a method for parti- 
tioning a block into one or more subblocks, the method being identical to one of 
those above, wherein boundaries may be added and removed in accordance with a 
further method. 

4. In a further aspect of the invention, the invention provides a method for parti- 
tioning a block into one or more subblocks, the method being identical to one of 
those above, wherein an upperbound U on the subblock size is imposed. 

5. In a further aspect of the invention, the invention provides a method for parti- 
tioning a block into one or more subblocks, the method l)eing identical to one of 
those above, wherein a lowerbound L on the subblock size is imposed. 

C. In a further aspect of the invention, the invention provides a method for partition- 
ing a block into one or more subblocks. the method being identical to one of those 
above, wherein an upperbound ( * on the subblock size is imposed and a lowerbound 
L on tlie subl>lock size is also imposed. 

7. In a further aspect of the invention, the invention uses one or more of the methods 
above, but applies more than one partitioning function (e.g. T,. F^. . . .) and method 
independently to the block b so as to form more than one group of subblocks. 

Note: The subblocks of the various groups arc vcrtj likely to ore Hap. The grentps proefuccd 
by this asjHct can be used inekpendcutly or combined in various ways to form lanpr groups 
of subblocks. 
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S. In a further aspect of the invention, the invention provides a method for parti- 
tioning a block into one or more subblocks by dividing the block into subblocks of 
equal size. 

Sole: This aspect is not novel and has been included solely so that later aspects can refer 
to it. 

9. In a further aspect of the invention, the invention provides a method for parti- 
tioning a block into one or more subblocks by dividing the block into subblocks of 
a small number of different sizes. 

Note: This aspect is not novel and has been included solely so that later as]>ects can refer 
to it, 

10. In a further aspect of the invention, the invention uses one of the methods above, 
and additionally forms subblocks from one or more groups of subblocks. 

11. In a further aspect of the invention, the invention employs one of the methods 
above, and additionally forms a hierarchy of subblocks from one or more contiguous 
groups of subblocks. 

j\'otc: The aspects abovi xrill be referred to as the partitioning aspects. 

I'J. In a further aspect of the invention, the invention provides a method for par- 
titioning a block into subblocks and forming a corresi)onding collection of hashes, 
comprising thr steps of: 

a. Partitioning the block into one or more subblocks in accordance with any parti- 
tioning aspect: 

1). Calculating the hash of one or more subblocks using a hash function //. 

.Vo/r; The collection of hashes is particularly useful if H is a strong one -way hash funrfifnt. 
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13. In a further aspect of the invention, the invention provides a method for con- 
structing a projection of a block, comprising the steps of: 

a. Partitioning the block into one or more subblocks in accordance with any parti- 
tioning aspect; 

b. Forming a projection which is an ordered or unordered list containing identities 
(e.g. subblocks or hashes of subblocks) of, or references to, one or more of the 
subblocks. 

^otc: The specification of this aspect is intended to admit lists that contain a mixture of 
various kinds of identities and references. 

i\otc: hi most applications the output of this asjxct will bf an ordered list of hashes of the 
subblocks of the block, 

14. In a further aspect of the invention, the invention provides a method for finding 
identical portions within a group of one or more blocks comprising the steps of: 

a. Partitioning one or more of said blocks into one or more sul^blocks in accordance 
with an aspect above; 

b. Comparing the subl)lock5 or the identities (e.g. hashes) of the subblocks. 

15. In a further aspect of the invention, the invention provides a method for repre- 
senting one or more blocks, involving the following < omponents: 

(i) .A method for storing and retrieving subblocks: 

(ii) A mapping from block representatives (e.g. filenames) lo lists of entries that 
identify subblocks; 

whereby the modification of data in a stored block involves the following steps: 
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a. Partitioning the new data into subblocks in accordance with any partitioning 
aspect: 

b. Adding subblocks in the new data that are not already in the collection of 
stored subblocks to the collection of stored subblocks, and updating the subblock 
list associated with the block being modified; 

16. In a further aspect of the invention, the invention provides a method for an 
entity El to communicate a group A' of one or more subblocks .V, ...A'n to E2 
where El possesses the knowledge that £"2 possesses a grou]> V of zero or more 
subblocks Vi . . . Vm comprising the following step: 

a. Transmitting from El to £2 the contents of a subset of zero or more subblocks in 
A\ and the remaining subblocks as references which may take (l)ut are not limited 
to) the following forms: 

(i) a hash of a subblock; 

(ii) a reference to a subblock in V; 

(iii) a reference to a range of subblocks in V ; 

(iv) a reference to a subblock already transmitted: 

(v) a reference to a range of subblocks already transmitted. 

Nott: lu DW.^t i m pi f mentations of this aspect, the subblocks xrhosc contents an transmitted 
will be those in A that an not in V. and for which no identical subblock' has pn rionsly 
been transmitted. 

S'otc To possess knowledfff that E2 possesses Vj El need not actually posse^cs 

Vi...V„, itstlf E\ need only possess th< identities of (c.g^ the hashes of each 
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subblock I'l ,,.Ym)' This specification is intended to admit any other representation in 
which El may have the knowledge that E2 possesses (or has access to) >'i ...V^. In 
particular, the knowledge may take the form of a projection ofY, 

Note: It is implicit in this aspect that El will be able to use comparison (or other methods) 
to use its knowledge of E2 > possession of Y to determine the set of subblocks that are 
common to both X and V. for example, if El possessed the hashes of the subblocks of 
i7 could compare them to the hashes of the subblocks of X to determine the subblocks 
common to both X and V. Subblocks that are not common can be transmitted explicitly, 
Subblocks that are common to both X and Y can be transmitted by transmitting a reference 
to the subblock, 

17. In a further aspect of the invention, the invention provides a method for an 
entity El to communicate a block X to E2 where El possesses the knowledge thai 
E2 possesses a group Y of subblocks V, . . . V'^ comprising step (a) of aspect 16 
preceded by the step: 

s. Partitioning A' into subblocks A'l . . . .V„ in accordance with any partitioning 
aspect. 

IS. In a further aspect of the invention, the invention provides a method for an entitv 
El to communicate one or more subblocks of a group A' of subblocks A', . . . A'„ to 
E2 where El possesses the knowledge that E2 possesses the block V. comprising 
stej) (a) of aspect 16 preceded by the step: 

s. Partitioning V into subblocks V, . . . V„, in accordance with any partitioning as- 
pect. 

19. In a further aspect of the invention, tiie invention provides a method for an 
entity El to communicate a block A' to E2 where El possesses the knowledge that 
E2 possesses block V. comprising step (a) of aspect 16 preceded by the steps: 
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si. Partitioning A' into subblocks A'l ...A'„ in accordance with any partitioning 
aspect. 

s2. Partitioning Y into subblocks V*i ..A'm in accordance with any partitioning 
aspect. 

NoU: Steps (si) and (s2) could be performed in any order, or concurrently, as could wany 
other subsets of steps in this patent specification. 

20. In a further aspect of the invention, the invention provides a method for con- 
structing a block D from a group A" of one or more sul>blocks A'l . . . An and a group 
V of zero or more subblocks V'l . . . Ym such that A' can be constructed from V and 
D. comprising the step: 

a. Constructing D from at least one of the following components: 

(1) the contents of one or more subblocks in A': 

(2) references to subblocks in Y or to subblocks included in D. or to a range of 
subblocks from either D or Y. 

Note: Component 2 above is intended to encompass the cas( u-hcrt a mixturt of the ek- 
ments it describes is used. 

21. In a further aspect of the invention, the invention provides a method for con- 
structing a block D from a block A' and a group Y of subblocks ^ , . . . ^ ^ such that 
A can be constructed from V and D, comprising sle]> (a) of aspect 20 preceded by 
the step: 

s. Partitioning A into subblocks Ai V,, in accordance witl» any partitioning 

aspect . 
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22. In a further aspect of the invention, the invention provides a method for con- 
structing a block D from a group .V of subblocks .V, . . .A„ and a block V such that 
.V can be constructed from V and D. comprising step (a) of aspect 20 preceded by 
the step: 

s. Partitioning Y into subblocks 1', ...V„, in accordance with any partitioning as- 
pect. 

23. In a further aspect of the invention, the invention provides a method for con- 
structing a block D from a block .V and a block Y such that X can be constructed 
from Y and D. comprising step (a) of aspect 20 preceded by the steps: 

si. Partitioning X into subblocks A, ...A'„ in accordance with any partitioning 
aspect . 

s2. Partitioning Y into subblocks V, . . . V;„ in accordance with any partitioning 
cispect. 

Note: Steps (si) and (s2) could be performed in any order, or concurrently. could many 
other subsets of steps hi this patent specification. 

24. In a further aspect of the invention, the invention prov ides a method for con- 
structing a block D from a grou]) A' of subblocks A, . . . A'„ and a projection of a 
block Y (or a projection of a group Y of subblocks V, . . . V„,). such that .V can be 
constructed from Y and D. comprising the step: 

a. Constructing D from at least one of the following components: 

(1 ) the contents of one or more subblocks in A': 

(2) references to subblocks in V or to subblocks included in /;. or to a range of 
sul)blocks from either 1) or Y. 
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Note: The projection ofY will usually have been calculated in accordance with aspect 13. 
The projection of a group of subblocks will usually have been calculated in accordance with 
step (b) of as]}cct 13. 

Note: An implementation will usually be able to use the projection ofV to determine if a 
subblock in X is also in Y . 

Note: Component 2 above is intended to encompass the case where a mixture of the eU- 
ments it describes is used. 

25. In a further aspect of the invention, the invention provides a method for con- 
structing a block D from a block -V and a projection of Y such that X can be 
constructed from V* and D, comprising the step of aspect 24 with the following step 
inserted before step (a): 

s. Partitioning .V into subblocks A'l . . , A'^ in accordance with any partitioning 
aspect; 

26. In a further aspect of the invention, the invention provides a method for con- 
structing a block A' (or group A' of subblocks A'l V„ ) from a group V* of subblocks 

V'l . . .Ym and a block D, where D was constructed in accordance with one of aspects 
20 to 25 above, comprising the step of: 

a. Constructing A' from D and V by constructing the subblocks of X based on one 
or n)ore of: 

(i) references in D to subblocks in V 

(ii) references in D to subblocks in D: 

(iii) references in D that specify a rang<» of subblocks in V; 

(iv) references in D that specify a range of subblocks in D\ 
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(v) subblocks contained within D: 

(vi) other data elements in D. 

27. In a further aspect of the invention, the invention provides a method for con- 
structing a block .V (or group A' of subblocks A'l . . . X„ ) from a block Y and a block 
D, wiiere D was constructed in accordance with one of aspects 20 to 25 above, 
comprising the step of aspect 26 with the following step inserted before step (a): 

s. Partitioning V" into subblocks )] ..A'm in accordance with any partitioning as- 
pect. 

2S. In a further aspect of the invention, the invention provides a method for trans- 
mitting a group A' of subblocks A'l . . . A„ from one en tit \ E] lo another entity E2, 
comprising the steps of: 

a. Transmitting from E] to E2 an identity of one or more subblocks; 

b. Transmitting from E2 to E\ information communicating the presence or absence 
of subblocks at £"2: 

c. El transmitting to E2 at least the sul>blocks identified in step (b) as not being 
present at E2: 

SoU : Tin iiiformaiiofi cowwunicated in step (b) could tak( tfif fonii of a bitmap (or a 
conif>nss(d bitmap) corrtiipondiiuj to the subblocL^ referred to in step (a). It could also 
tak( niatnj other forms, 

Sott: If (I group of subblocks an to bi transmitted, the obov< steps could be performed 
compktcltj for each subblock befon moving onto the next subbloek, Th( steps could be 
a pplit d to any s a bg ro up of s ubblocks . 
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29. In a further aspect of the invention, the invention provides a method for com- 
municating a data block A' from one entity El to another entity E2. comprising the 
steps of aspect 2S but with the following step inserted before step (a): 

s. Partitioning A' into subblocks A'i...A'„ in accordance with any partitioning 
aspect. 

30. In a further aspect of the invention, the invention provides a method for com- 
paring the contents of two or more blocks comprising the steps; 

a. Constructing a projection of each block as described in aspect 13; 

b. Comparing the projections of the blocks. 

Note: The phrase ^cornpariug the projections^ is intended to include not just tht casi 
where the two projections are tfstcd to see if they are the sowe, but also the case where 
the subbl<M:ks (or projections of subblocks) within the projections are compared so as to 
deterrnine the subblocks that are common to the two origined blocks. 

31. In a further aspect of the invention, tlie invention provides a mctliod for trans- 
mitting a group A' of subl>locks A'l V„ from one entity E] to another entity £"2. 

comprising the steps of: 

a. Transmitting from E2 to E\ information communicating the presence or absenr<» 
at £2 of members of a group V of subblocks Y'l . . . 

b. Transmitting from El to £2 the contents of zero or more subblocks in A. and 
the remaining subblocks as references which may take (but arr not limited to) ti»r 
following forms: 

(i) a hash of a subblock: 

(ii) a reference to a subblock in V: 
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(iii) a reference to a range of subblocks in >'; 

(iv) a reference to a subblock already transmitted; 

(v) a reference to a range of subblocks already transmitted. 

Note: The information communicated in step (a) could take the form of subbloch identities 
such as hashes. It could also take many other forms. 

32. In a further aspect of the invention, the invention provides a method for trans- 
mitting a block -V from one entity El to another entity E2. comprising the steps of 
aspect 31 with the following step inserted before step (a): 

s. El partitioning X into subblocks A'l V„ in accordance with any partitioning 

aspect; 

33. In a further aspect of the invention, the invention provides a method for an 
entity E2 to communicate to an entity El the fact that E2 possesses a group ) of 
subblocks Vj . . . VjTi. comprising the step of: 

a. E2 transmitting to El identities or references of the subblocks I'l . . . 

34. In a further aspect of the invention, the invention provides a method for an 
entity E2 to communicate to an entity E\ the fact that E2 ])ossesses a block V. 
comprising the step of aspect 33 with the following step inserted before step (a): 

s. E2 partitioning ) into subblocks \\ ...V„, in accordance with any partitioning 
aspect: 

35. In a further aspect of the invention, the invention provides a method for an 
entity El to communicate a subblock .V, to an entity E2. comprising the ste])s: 

a. E2 sending El an identity of A',. 

IG 
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b. £1 sending A', to E2. 

Note: This aspect applies (among other applications) to the case of a nctxvork' server E\ 
that serves svbblocks to clients such as E2, given the identities (e.g. hashes) of (he requested 
subblocks. 

36. In a further aspect of the invention, the invention provides a method for an 
entity El to communicate a subblock A', to an entity E2. comprising the steps of 
aspect 35 with the following step inserted before step (a): 

s. El partitioning a block A' into subblocks A'l . . . A'^ in accordance with any 
partitioning aspect; 

37. In a further aspect of the invention, the invention provides a method similar 
to any of those above, but where one or more of the comparisons of subblocks are 
performed by comparing the hashes of the subblocks. using hashes aheady available 
(e.g. as a byproduct of other steps), or calculated for the purpose of performing one 
or more said comparisons. 

38. In a further aspect of the invention the invention provides a rnolhod similar 
to any of those above, but where subsets of identical subblocks within a group of 
one or more subblocks are identified by inserting each subblock. an identity of each 
subblock. a reference of each subblock, or a hash of each sul)block. into a data 
structure. 

39. In a further aspect of the invention the invention provides a m<*thod identical 
to any of those above, but with various subsets of steps executed concurrently. 

40. In an aspect of the invention, the invention provides an ap])aratus for partition- 
ing a block 6 into one or more subblocks. the apparatus comprising: 

(i) means for evaluating a deterministic or non-deterministic function F that returns 



17 



wo 9605801 



PCT/AU96/00081 



one of at least, two values, and whose arguments include at lefist a block of A bits 
and a block of B bits, where .4 and B are natural numbers; 

and comprising the step of 

a. Generating a set of partitions of 6, basing these upon the positions of subblock 
boundaries on the positions k in the block for which F{bk~A . . . bf,, bk^i . . . 6^+^ ) falls 
within a predetermined subclass of the set of possible function result values. 

Note: The specification of this aspect (and each other specification of on aspect involving 
a function F) encompasses the degenerate case in which either A or B is zero and thf 
function takes (without limitation) just one of the two arguments described. Such s]}ecifi' 
cations also include the case of functions F that do not use some bits of their arguments. 
A function F that bases its calculation solely on (say) b^^s and b^^2 would fall under the 
classes of F constrained by the condition A > 3 and B > 2. 

41. In a further aspect of the invention, the invention provides an apparatus for 
locating the nearest subblock boundary on a particular side of a particular position 
;) within a block b, the apparatus comprising the same elements as aspect 40. but 
replacing step (a) with 

a. Generating a position within b hy evaluating F(bp^A ^^ bp.bp^i ...6p+;^) for in- 
creasing (or decreasing) /> until the result of F falls within a ])redetermined subclass 
of the set of possible function result values, the position being based on this position. 

42. In a further aspect of the invention, the invention provides an apparatus for 
partitioning a block into one or more subblocks comprising 

(i) means for dividing a block into subblocks of equal size. 

Note: This aspect is not novel and has been included solely so thai later asj>erts can reft r 
to it. 
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43. In a further aspect of the invention, the invention provides an apparatus for 
partitioning a block into one or more subblocks comprising 

(i) means for dividing a block into subblocks of a small number of different sizes. 

A'o/f ; This aspect is not novel and has been included solelt^ so thai later aspects can refer 
to it, 

44. In a further aspect of the invention, the invention provides an apparatus £1 
that can communicate a group .V of one or more subblocks .Vj . . . A'n to an entity 
E2 where El possesses the knowledge that E2 possesses a group Y of zero or more 
subblocks Yi . . . V^n, the apparatus comprising 

(i) means for manipulating subblocks, subblock identities, and suhblock references: 
and comprising the step 

a. Transmitting from El to E2 the contents of a subset of zero or more subblocks in 
A', and the remaining subblocks as references which may take (but are not limited 
to) the following forms: 

(i) a hash of a subblock: 

(ii) a reference to a subblock in V: 

(iii) a reference to a range of subblocks in V; 

(iv) a reference to a subblock already transmitted: 

(v) a reference to a range of subblocks already transmitted. 

:\oti: In nwst implementations of this aspect, the subblocks ahost contents arc transmitted 
will Ik those in X that art not in )\ and for xrhirh utt identical svbblock has pnvioustij 
been transmitted. 
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Noic: To possess knowledge thai E2 possesses Y\ ...Ym, El need not actually possess 
V| . . . Vm itself. El need only possess the identities of >'| . . . Vm {e,g, the hashes of each 
subblock >'i ...Ym)' This specification is intended to admit any other representation in 
which El may have the knowledge that E2 jwssesses (or has access to) Vi ...Vm- /" 
particular, the knowledge juay take the form of a projection ofY. 

Note: It is implicit in this aspect that El will be able to use comparison (or other methods) 
to use its knowledge of E2's possession of Y to determine the set of subblocks that art 
common to both X and For example, if El possessed the hashes of the subblocks of 
y, it could compare them to the hashes of the subblocks of X to determine the subblocks 
common to both X and Y , Subblocks that are not common can be transmitted explicitly. 
Subblocks that are common to both X andY can be transmitted by transmitting a rtferencf 
to the subblock, 

45. In a further aspect of the invention, tlie invention provides an apparatus for 
constructing a block D from a group .V of one or more subblocks A'l . . . .V„ and a 
group 1' of zero or more subblocks V'l . . . V'^ such that X can be constructed from 
Y and D, the apparatus comprising: 

(i) means for manipulating subblocks, subblock identities and subblock references: 
and comprising the step 

a. C'Onstructing D from at least one of the following components: 

( 1 ) the contents of one or more subblocks in A': 

(2) references to subblocks in V or to subblocks included in D. or to a range of 
subblocks from either D or V*. 

Xotc ComjKfnenf 2 abon is intended to encotnpass the cas( ivhcn a mixture of th( r/r- 
menis it describes is used. 
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46. In a further zispect of the invention, the invention provides an apparatus for 
constructing a block D from a grouj) .V of subblocks A'l . . . A'„ and a projection of 
a block >' (or a projection of a group Y* of subblocks V'l . . . ), such that A' can be 
constructed from V* and the apparatus comprising 

(i) means for manipulating subblocks. subblock identities and subblock references; 
and comprising the step: 

a. Constructing D from at least one of the following components: 

(1) the contents of one or more subblocks in A': 

(2) references to subblocks in V or to subblocks included in D. or to a range of 
subblocks from either D or Y . 

Noif : The projection ofY will usually /larr becu calculated in accordance with aspect 13. 
The projection of a group of subblock fi will usually have been calculated in accordance with 
step (b) of aspect 13. 

Note: An implementation will usually be able to use the projection of}' to determine if a 
subblock in X is also in V. 

Note: Component 2 above is intended to encompass the case where a mixture of the eh- 
ments it describfs used, 

47. In a further aspect of the invention, the invention provides an apparatus for 
constructing a block A (or group A of subblocks A'l ...A„) from a group V of 
subblocks V| . . . Vm and a block D. where D was constructed in accordance with one 
of aspect? 20 to 25 ahove. the apparatus comprising 

(i) means for manipulating subblocks. subblock identities and subblock references: 
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and comprising the step of 

a. Constructing .V from D and V* by constructing the subblocks of A' based on one 
or more of: 

(i) references in D to subblocks in V; 

(ii) references in D to subblocks in D\ 

(iii) references in D that specify a range of subblocks in V: 

(iv) references in D that specify a range of subblocks in D\ 
(v ) subblocks contained within D\ 

(vi) other data elements in D. 

48. In a further aspect of the invention, the invention provides a system for transmit- 
ting a group X of subblocks Xx . . . A'„ from one apparatus FA t o another apparatus 
£2. each apparatus consisting of 

(i) means for manipulating subblocks, subblock identities and subblock references: 
and the systems execution comprising the steps of: 

a. Transmitting from E\ to El an identity of one or more subblocks: 

b. 1 ransmitting from El to E\ information communicating \ lie presence or absence 
of subblocks at E2: 

c. E\ transmitting to El at least the subblocks identihed in step (b) as not being 
present at £"2: 
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SoU: The injormatton comwiinicatcd in step (b) could take tin form of a bitmap (or a 
compressed bitmap) corresponding to the subblocks referred to in step (a). It could also 
takf. many other forms. 

!<ote: If a group of subblocks arc to be transmitted, the above steps could be performed 
completely for each subblock before moving onto the next subblock. The steps could be 
applied to any subgroup of subblocks. 

49. In a further aspect of the invention, the invention provides a system for transmit- 
ting a group .V of subblocks A'l . . . Xn from one apparatus El to another apparatus 
£2, each apparatus consisting of 

(i) means for manipulating subblocks, subblock identities and subblock references: 
and comprising the steps of: 

a. Transmitting from E2 to El information communicating the presence or absence 
at /:2 of members of a group V* of subblocks V'l . . . V^; 

b. Transmitting from El to E'i the contents of zero or more subl>locks in A', and 
the remaining subblocks as references which may take (but are not limited to) tin* 
following forms: 

(i) a hash of a subblock: 

(ii) a reference to a subblock in V: 

(iii) a reference to a range of subblocks in )': 

(iv) a reference to a subblock already transmitted; 

(v) a reference to a range of subblocks already transmitted. 
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Note: The information communicated in step (a) could take the form of suhblock identities 
such as hashcfi. It could also take many other forms. 

50. In a further aspect of the invention, the invention provides a system for an 
apparatus E2 to communicate to an apparatus El the fact that E2 possesses a 
group Y of subblocks Y] . . . 1^- each apparatus consisting of 

(i) means for manipulating subblocks, subblock identities and subblock references: 
and comprising the step of: 

a. £"2 transmitting to £1 identities or references of the subblocks 

51. In a further aspect of the invention, the invention provides a system for an 
apparatus El to communicate a subblock A', to an apparatus E2, each a|>paratus 
consisting of 

(i) means for manipulating subblocks, subblock identities and subblock references: 
and comprising the steps: 

a. E2 sending El an identity of A\. 

b. El sending A', to E2. 

.\'otf: This asj>f:ct applies (among other applications) to the case of a netv^ork server El 
that serves subblocks to clients such as El. (jix-cu the idrntitics (e.g. hnshis) of tin requested 
subblocks. 



BRIEF DESCRIPTION OF FIGURES 

Figure 1 shows how data can become **misaligned" relative lo its containing blocks 
when data in inserted. 
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Figure 2 shows how data can be divided into fixed-width subblocks or variable-width 
subblocks. 

Figure 3 shows how data-dependent partitions move with the data when the data 
is shifted (e.g. by an insertion) (Compare with Figure 1). 

Figure 4 depicts the data-dependent partitioning of a block of data b into subblocks 
using a function F, 

Figure 5 depicts the search within a block 6 for a subblock boundary (as defined by 
F) using F. 

Figure 6 shows how a block may be subdivided in different ways using different 
boundary functions. 

Figure 7 shows how "higher order*" subblocks can be constructed from one or more 
initial subblocks. 

Figure S shows how diflferent partitioning functions can produce subblocks of differ- 
ing average sizes. 

Figure 9 shows how subblocks can be constructed (or organized) into a hierarchy. 
Such a hierarchy can be constructed by restricting in stages. F s "partition" result 
subset. 

Figure 10 depicts a method (and apparatus) for the partitioning of a block h into 
subblocks using F and the calculation of the hashes of the sul>l)U)cks using hash 
function H. 

Figure 11 depicts the partitioning of a block b into subblocks using F and the 
projection of those subblocks into a structure consisting of subblock hashes, subblock 
data, and subblock references. 
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Figure 12 depicts a method (and apparatus) for partitioning two blocks 61 and 62 
into subblocks using F and the comparison of those subblocks. 

Figure 13 depicts a method (and apparatus) for the partitioning using F of two 
blocks 61 and 62 into subblocks, the calculation using H of the hashes of the sub- 
blocks, and the comparison of those hashes with each other to determine (among 
other things) subblocks common to both 61 and 62. 

Figure 14 depicts a method (and apparatus) for a file system that employs an aspect 
of the invention to eliminate the multiple storage of data common to more than file 
(or to different parts of the same file). 

Figure 15 depicts a method (and apparatus) for the communication of a block A' 
from El to E2 where both E\ and E2 possess Y. 

Figure 16 depicts a method (and apparatus) for the construction of a block D from 
which A' may be later reconstructed, given Y, 

Figure 17 depicts a method (and apparatus) for the construction of a block D from 
which A' may be later reconstructed, given V. In this case, the entity constructing 
D does not have access to V. only to a projection of V (in this case being the hashes 
of the subblocks of Y). 

Figure IS depicts a method (and apparatus) for the reconstruction of X from ihv 
blocks V and D. 

Figure 19 depicts a method (and apparatus (E\ and E2 at each lime)) for thr 
communication of a block A from entity E\ to entity El where* E2 already possesses 
V. 

Figure 20 depicts a method (and apparatus {El and E2 at each time)) for the 
communication of a block A' from entity £1 to entity E2 where E2 alread\ possesses 
V and where E2 first discloses to El information about V. 

2(i 
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Figure 21 depicts a method (and apparatus) for the communication from entity E2 
to entity El information about a block (or group of subblocks) )' at E2. 

Figure 22 depicts a method (and apparatus (£1 and E2 at each time)) for the 
communication from entity £1 to entity E2 of subblock A\ following a request by 
entity E2 for the subblock A\. 

Figure 23 depicts an apparatus for partitioning a block b (the input) using a parti- 
tioning function F. The output is a set of subblock boundary positions. 

Figure 24 depicts a method (and apparatus) for the partitioning of a block b into 
subblocks using F and the projection of those subblocks into a table of subblock 
hashes. 

Figure 25 depicts a method (and apparatus) for the transmission from entity £1 to 
£2 of a l)lock .V where £2 possesses Y and £1 possesses a table of the hashes of 
the subblocks of Y (a projection of V^). 

Figure 2G depicts a method (and apparatus) for a file system that employs an aspect 
of the invention to eliminate the multiple storage of data common to njore than file 
(or to different parts of the same file). 



DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

This section contains a detailed discussion of mechanisms that could be used to 
implement aspects of the invention. It also contains a selection of examples of 
implementations of various aspects of the invention. However, nothing in this section 
should be interpreted as a limitation on the scope of this patent. 
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Units Of Information 

Aspects of this invention can be applied at various levels of granularity of data. For 
example, if the data was treated as a stream of bits, boundaries could be placed 
between an>- two bits. However, if the data was treated as a stream of bytes, 
boundaries would usually be positioned only between bytes. The invention could 
be applied with any unit of data, and in this document references to bits and bytes 
should usually be interpreted as admitting any granularity. 

The Concept Of Entity 

At various places, this patent specification uses the term ''entity" to describe an 
agent. This term is |)urposefully vague and is intended to cover all forms of agent 
including, but not limited to: 

• Computer systems. 

• Networks of computer systems. 

• Processes in computer systems. 

• File systems. 

• C'omi)onenls of software. 

• Dedicated comi)ulcr systems. 

• Conmumications svstems. 



The Concepts Of Identity And Reference 

This pate!)t s|)ecification frequently refers to "identities'" of subblocks and "refer- 
ences" to subblocks. These terms are not intended to l>e defined precisely. 
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The identity of a subblock means any piece of information that could be used 
in place of the subblock for the purpose of comparison for idcnlicality. Identities 
include, but are not limited to: 

• The subblock itself. 

• A hash of the subblock. 

The subblock acts as its own identity because subblocks themselves can be com- 
pared with each other. Hashes of subblocks also act as identities of subblocks be- 
cause haslies of subblocks can be compared with each other to determine if their 
corresponding subblocks are identical. 

A reference to a subblock means any ])iece of information that could be used in 
practice by one entity to identify to another entity (or itself) a ])articularly val- 
ued subblock. where the two entities may already share some kind of knowledge. 
For example, the two entities might each possess the knowledge that the other en- 
tity already possesses ten subblocks of known values liaving particular index values 
numbered one to ten. 

Once two entities have a l)asis of shared knowledge, it is possil>lc for tliem to identify 
a subblock in ways more concise than thr transmission of an idenlitw A reference 
to a particularly- valued subblock can take (without limitation) tho following forms: 

• An identity. 

• Au identifying number of a sul)l)lock jmssessed by the receiver. 

• \u identifying number of a subblock previously transmitted between the two 
communicants. 
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• The location of the block in some shared data space. 

• A relative subblock number. 

• Ranges of the above. 

The concept of knowledge of a subblock is related to the concepts of identity and 
reference. An entity may have knowledge of a subblock (or knowledge that another 
entity possesses a subblock) without actually possessing the subblock itself. For 
example, it might possess an identity of the subblock or a reference to the subblock. 

The Use Of Ranges 

In any situation where a group of values that have contiguous values (e.g. C, 7. S. 
9) is to be communicated or stored, such a group can he represented using a range 
(e.g. G-9) which may take up less communication time or storage space. Ranges can 
be applied to all kinds of things, such as index values and subblock numbers. In 
particular, if an entity notices that the references (to suhblocks) that it is about to 
transmit are contiguous, it can replace the references witli a range. 

Ranges can l)e represented in any way that identifies \ \\c first and last element of 
the range. Three common representations are: 

1. Tlu' first and last element of the range. 

2. TIk' first element and the length of the range. 
The last element and the length of the range. 

The concept of range can be generalized to include the compression of any group of 
values that exhibit compressible structure. 
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The Use Of Backward References 

References can be used not only to refer to data sliared by two communicants at 
the start of a transmission, but can also be used to refer to data communicated at 
some previous time during the transmission. 

For example, if an entity A notices that the subblock it is about to transmit to 
another entity B was not possessed by B at the start of the transmission, but 
has since been transmitted by A to B, then ,4 could code the second instance of 
the subblock as a reference to the previous instance of the subblock. The range 
mechanism czui be used here too. 

No Requirement For Subblock Framing Information 

It should be noted that it is possible that an entity E\ could transmit a group X of 
subblocks A'l . . . X„ as a group to an entity E2 simply by sending the concatenation 
of the subblocks. Therr may be no need for any framing information (e.g. informa- 
tion at the start of each siil)l)lock giving the length of the subblock or ""escape** codes 

to indicate subblock boundaries) as E2 is capable of partitioning A' into A'l 

itself. 

No Requirement For Ordering Subblocks 

It should be noted that if two entities El and E2 both possess the same uuordfvfd 
group V of subblocks (or knowledge of such a group of subblocks) then even though 
£1 and E2 may not possess the subblocks in the same order, the subblocks can still 
be referred to using a subblock index or serial number. Tliis is achieved by having 
E\ and El each sort their subblocks in accordance with some mutually agrec^d (or 
universally defined) ordering method and then number the subblocks in the resultant 
ordered group of subblocks. These numbers (or ranges of such numbers) ran then 
be used to refer to subblocks. 
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All Overview Of Hash Functions 

Although the use of a hash function is not essential in all aspects of this invention, 
hash functions provide such advantages in the implementation of this invention that 
an overview of them is warranted. 

A hash function accepts a variable-length input block of bits and generates an output 
block of bits that is based on the input block. Most hash functions guarantee that 
the output block will be of a particular length (e.g. IG bits) and aspire to provide 
a random, but deterministic, mapping between the infinite set of input blocks and 
the finite set of output blocks. The property of randomness enables these outputs, 
called ""hashes", to act as easily manipulated representatives of the original block. 

Hash functions come in at least four clas.ses of strength. 

Narrow hash functions: Narrow hash functions are the weakest class 
of hash functions and generate output values that are so narrow (e.g. 16 
bits) that the entire space of output values could be searched in a rea- 
sonable time. For example, an S-bit hash function would map any data 
block to a hash in the range 0 to 255. A 1G-I)ii hasli function would 
map to a hash h\ the range 0 to C5535. Given a particular hash value, 
it would be possible to find a corresponding block simply by generat- 
ing random blocks and feeding them into the narrow hash function until 
the searched-for value appeared. Narrow hash functions are usually u.sed 
to arbitrarily (but deterministically ) classify a set of data values into a 
small number of groups. .As such, they are us(*ful for constructing hash 
table data structures, and for detecting errors in data transmitted over 
noisy communication channels. Examples of this class: CliC-lG. CRC- 
3'2. Fletcher checksum, the 11^ checksum. 

Wide hash functions: Wide hash functions are similar to narrow hash 



functions except that their output values are significantly wider. At a 
certain point this quantitative difference implies a qualitative difference. 
In a wide hash function, the output value is so wide (e.g. 128 bits) that the 
probability of any two randomly chosen blocks having the same hashed 
value is negligible (e.g. about one in 10^). This property enables these 
wide hashes to be used as "identities'" of the blocks of data from which 
they are calculated. For example, if entity El has a block of data and 
sends the wide hash of the block to an entity E2. then if entity E2 has a 
block that has the same hash, then the a-priori probability of the blocks 
actually being different is negligible. The only catch is that wide hcish 
functions are not designed to be non-invertible. Thus, whilo the space of 
(say) 2*^*^ values is too large to search in the manner described for narrow 
hash functions, it may be easy to analyse the hash function and calculate 
a block corresponding to a particular hash. Accordingly. El could fool 
E2 into thinking El had one block when it really had a different block. 
Examples of this class: any 128-bit CRC algorithm. 

Weak one-way hash functions: Weak one-way hash functions are not 
only wide cnougii to provide ''identity", but they also [)rovidc crypto- 
graphic assurance that it will be extremely difficult, given a particular 
hash value, to find a block corresponding to that hash valii<*. Examples 
of this class: a 64-bit DES hash. 

Strong one-way hash functions: Strong one-way hash functions are 
the same as weak one-way hash functions except that they have the ad- 
ditional property of providing cryptographic assurance that it is difficult 
to find auij two different blocks that have the same hash value, where the 
hash value is uns|)ecified. Examples of this class: MD4. MDo. SH.A-1. 
and Snefru. 
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These four classes of hash provide a range of hashing strengths from which to choose. 
As might be expected, the speed of a hash function decreases with strength, pro- 
viding a tradeoff, and different strengths are appropriate in different applications. 
However, the difference is small enough to admit the use of strong one-way hash 
functions in all but the most time-critical applications. 

The term cryptographic hash is often used to refer to hashes that provide cryp- 
tographic strength, encompassing both the class of weak one-way hash functions 
and the class of strong one-way hash functions. However, as strong one-way hash 
functions are almost always preferable to weak one-way hash functions, the term 
"cryptographic h2Lsh" is used mainly to refer to the class of strong one-way hash 
functions. 

The present invention can employ hash functions in at least two roles: 

1. To determine subblock boundaries. 

2. To generate subblock identities. 

Dci)ending on the application, hash functions from any of the four clzisses above 
could be employed in either role. However, as the determination of subblock bound- 
aries docs not require identity oi cryptographic strength, it would be inefficient to 
use hash functions from any but the weakest class. Similarly, the need for iden- 
tity, the ever-present threat of subversion, and the minor performance penalty for 
strong one-way hash functions suggests that nothing less than strong one-way hash 
functions should be used to calculate subblock identities. 

The security dangers inherent in employing anything less than a strong one-way hash 
function to generate identities can be illustrated by considering a communications 
system or file system that incorporates the invention using any such weaker hash 
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function. In such a system, an intruder could modify a subblock (to be manipulated 
by a target system) in such a way that the modified subblock has the same hash as 
another subblock known by the intruder to be already present in the target system. 
This could result in the target system retaining its existing subblock rather than 
replacing it by a new one. Such a weakness could be used (for example) to prevent 
a target system from properly applying a security patch retrieved over a network. 

Thus, while wide hash functions could be safeh* used to calculate subblocks in sys- 
tems not exposed to hostile humans, even weak one-way hash functions are likely to 
be insecure in those systems that are. 

We now to turn to the ways in which hashes of blocks or subblocks can actually be 
used. 

The Use Of Cryptographic Hashes 

The theoretical properties of cryptographic hashes (and here is meant strong one- 
way hash functions) yield particularly interesting i)ractical properties. Because such 
hashes are significantly wide, the probability of two randomly-chosen subblocks hav- 
ing the same hash is practically zero (for a r2S-bit hash, it is about one in 10'^^). 
and because it is computationally infeasible to find two subblocks having the same 
hash, it is practically guaranteed that no intelligeiU agent will be able to do so. The 
implication of these properties is that from a practical perspective, the finite set 
of hash values for a particular cryptographic hash algorithm is one-to-one with the 
infinite set of finite varial)le length subblocks. This theoretically impossible prop- 
erty manifests itself in practice because of the practical infeasibility of finding two 
subblocks that hash to the same value. 

This ])roperty means that, for the purposes of comparison (for idenlicality). crypto- 
graphic hashes may safely be used in place of the subblocks from which they were 
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calculated. As most cryptographic hashes are only about 12S bits long, hashes pro- 
vide an extremely efficient way to compare subblocks without requiring the direct 
comparison of the content of the subblocks themselves. This can he used to elimi- 
nate many transmissions of information. For example, a subblock A'l on a computer 
CI in Sydney could be compared with a subblock V'l on a computer C2 in Boston 
by a computer C3 in Paris, with the total theoretical network traffic being just 256 
bits (Cl and C2 each send the 12S-bit hash of their respective subblocks to C3 for 
comparison, and C'3 compares the two hashes). 

Some of the ways in which cryptographic hashes could be used in aspects of this 
invention are: 

• Cryptographic hashes can be used to compare two subblocks without having 
to compare, or requiring access to, the content of the subblocks. 

• If it is necessary to be able to determine whether a subblock 7' is identical 
to one of a group of subblocks, the subblocks themselves need not be stored, 
just a list of their hashes. The hash of any candidate subblock can then be 
compared with the hashes in tlie list to establish whether the sul>block is in 
the grouj) of subblocks from which the list of hashes was generated. 

• Cryptographic hashes can be used to ensure that the partitioning of a block 
into subblocks and the subsequent reassembly of the subblocks into a recon- 
structed block is error-free. This can be done by comparing the hash of the 
original block with the hash of the reconstructed block. 

• If an entity FA calculates the hash of a subblock A'j and transmits it to £2. 
then if E2 possesses AV or even just the hash of A'|. then E'l can determine 
without any practical doubt that E] ]>ossesses A'l. 

• If an entity £1 passes a key (consisting of a l>lock of l)its) chosen at random 
to an entity E2. E2 may then prove to E\ that it possesses a subblock bv 
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sending £1 the ha^h of the concatenation of the key and the subblock. This 
mechanism could be used as an additional check in security applications. 

• If a group of subblocks must be compared so as to find all subsets of identical 
subblocks. the corresponding set of hashes of the subblocks may be calculated 
and compared instead. 

• Many of the uses of cryptographic heishes for subblocks can also be applied to 
blocks. For example, cryptographic hashes can be used to determine whether 
a block has changed at all since it was laist backed up. Such a check could 
eliminate the need for further analysis. 

Use Of Hashes As A Safety Net 

A potential disadvantage of deploying aspects of this invention is that it will add 
extra complexity to the systems into which it is incorporated. This increased com- 
plexity carries the potential to increase the chance of undetected failures. 

The main mechanism of complexity introduced by many as])ects of the invention is 
the partitioning of blocks (e.g. files) into subblocks. and the subsequent re-assembl> 
of such subblocks. By partitioning a block into subblocks. a system creates the 
potential for subblocks to be erroneously added, deleted, rearranged, substituted, 
duplicated, or in some other way exposed to a greater risk of accidental error. 

This risk can be reduced or eliminated by calculating the hash (])referably a cryp- 
tographic hash) of the block before it is )>artilioned into subblocks. storing the hash 
with an entity associated with the block as a whole and then later comparing the 
stored hash with a computed hash of the reconstructed subblock. Such a check 
would provide a very strong safety net that would virtually eliniinate the risk of 
undetected errors arising from the use of this invention. 
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Choosing A Partitioning Function 

Althougli the requirements for tlie block partitioning function F are not stringent, 
care should be taken to select a function that suits the application to which it is to 
be applied. 

In situations where the data is highly structured and knowledge of the data is 
available, a choice of an F that tends to place subblock boundaries at positions in 
the data that correspond to obvious boundaries in the data could be advantageous. 
However, in general. F should be chosen from the class of narrow hash functions. 
Use of a narrow hash function for F provides both efficiency and a (deterministic) 
randomness that will enable the implementation to operate effectively over a wide- 
range of data. 

One of the most important properties of F is the probability that F will place a 
boundary at any particular point when applied to completely random data. For 
example, a function with a probability of one would produce a boundary between 
each bit (or byte), whereas a function with a probability of zero would never produce 
any boundaries at all. In a real application, a more moderate probability would be 
chosen (e.g. 1/1024) so as to yield useful subblock sizes. The probability can be 
tuned to suit the application. 

We end this section with an example of a narrow hash function that has been 
implemented and tested and seems to perform well on a variety of data types. The 
hash function calculates a hash value from three bytes. 



^2.63) = ((40543 X ({b, «S) C (/>2«4) Gf;3))»-l) I P 

The following notation has been used, "x** is multi[)lication. "<?C" is left bit shift. 
" is right bit shift. " 7 " exclusive or. "|" is modulo. The constant p is 
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the inverse of the probability of placing a boundary at an arbitrary position in a 
randomly generated block of data, and can be set to any integer value in [0.65535]. 
However, in practice it seems to be advantageous to choose values that are ])rime 
(Mersenne primes seem to work well). The value 40543 was chosen carefully in 
accordance with the guidelines provided in pages 50S-513 of the book: 

Knuth D.E., ""The Art of Computer Programming: Volume 3: Sorting 
and Searching"*- Addison Wesley, 1973. 

The function evaluates to a value in the range [0,/) — 1] and can be used in practice 
by placing a boundary at each point where the preceding three bytes hash to a 
predetermined constant value W This would imply that its arguments bi,..bs 
correspond to the argument .4 in aspect one above. To avoid pathological behaviour 
in the commonly occurring case of runs of zeros, it is wise to choose a non-zero value 
for I '. 

In a real implementation, p was set to 511 and \ was set to one. 

Although early aspects of the detailed description of the invention refer to a function 
F tiiat places a boundary when its output vahu* falls within a prodctcrininod subclass 
of the set of possible output values, it should be noted that the combination of F 
and its use in the invention can always be viewed as e(|uivalent to a boolean function 
B{x,y.z) (where x and y are blocks of data) where 

and where G is a function accepting a value of whatever type F returns and returning 
a boolean, being Irut iff the value from F falls within a predetermined subclass 
defined for F. The z arguments have been included to indicate that the functions 
could depend on other information too. 
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Placing All Upper And Lower Bound On The Subblock Size 

The use of data-dependent subblock boundaries provides a way to deterministically 
partition similar portions of data in a context-independent way. However, if artifi- 
cial bounds are not placed on the subblock size, particular kinds of data will yield 
subblocks thai are either too large or too small to be effective. For example, if a 
file contains a block of a million identical bytes, any deterministic function F (that 
operates at the byte level) must either partition the block into one subblock or a 
million subblocks. Both alternatives are undesirable. 

A solution to this problem is to artificially impose an upper bound and a lower 
bound L on the subblock size. There seem to l>c a limitless number of ways of doing 
this. Here are some examples: 

Upper bound: Subdivide subblocks defined by F that are longer than 
U bytes at the points, 2U , V\ and so on. where V is the chosen 
upperbound on subblock size. 

Upper bound: Subdivide subblocks that are longer than (' bytes at 
points determined by a secondary hash function. 

Lower bound: Of the set of boundaries that bound subblocks less than 
L bytes long, remove those boundaries that are closer to their neighbour- 
ing boundaries than their neighbouring l^oundaries are to their neigh- 
bouring boundaries. 

Lower bound: If the block is being scanned sequentially, do nol place 
a boundary unless at least L bytes have b(x.*n scanned since the previous . 
boundary. 

Lower bound: Of the set of boundaries that bound subblocks less than 
L bytes long, remove those boundaries that satisfy some secondary hash 
function. 
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Lower bound: Of the set of boundaries that bound suhblocks less than 
L bytes long, remove randomly chosen boundaries until all the resulting 
suhblocks are at least L bytes long. 

Many other such schemes could be devised. 

Partitioning Blocks Using Non Data-Dependent Means 

Although this invention will usually be applied using data-dependent, variable- 
length subblocks. it can also be applied using non data-dependent subblocks. For 
example, the input blocks could simply be partitioned into 7?-byte blocks. Non data- 
dependent partitionings could be very effectively applied to blocks whose content 
varies but. does not move about (i.e. the bits or bytes of the data are modified, but 
bits or bytes are not inserted, deleted, or re-arranged). 

Another way of using fixed-length blocks would be to use many different overlapping 
partitionings of fixed-length blocks. 

There are an infinite number of other ways of partitioning a block into subblocks 
without referring to its content. 

Tlie Use Of Multiple Partitionings 

In most applications the use of just one partitioning into subl)locks will be sufficient. 
However, in some a|)])lications there may bo a need for more than one sul>block par- 
titioning. For example, in applications where channel sparo is ex|)ensive. it may 
be approi>riaie lo partition each block of data in \\ different ways using U differ- 
ent functions Fi , . . F\y where each function |)rovides a different average subblock 
size. For example, four different partitions could l)e performed using functions that 
provide subblocks of average length '256 bytes. IK. lOK. and lOOK. By proxidinc 
a range of different sizes of subblocks to choose from, such an organization could 
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simultaneously indicate large blocks extremely efficiently, while still retaining fine- 
grained subblocks so that minor changes to the data do not result in voluminous 
updates (Figure 8). 

The efficiency of such a scheme could be improved by performing the partitioning 
all in one operation using increasing constraints on a single F. For example, the 
example hash function that was described earlier could be used, but with different 
values of the constant p being used to determine the different levels of subdivision. 
By choosing appropriately related values of p, the set of boundaries that could be 
produced by the different F could be arranged to be subsets of each other, resulting 
in a tree structure of subblocks. For example, values of p of 32, 64, and 12S. and 
256 could be used. Figure 9 shows how the subblocks of four levels of the tret^ could 
relate to each other: 

A further method could define the hash of a larger block to be the hash of the hashes 
of its component blocks. 

Multiple partitionings may also be useful simply to provide a wider pool of subblocks 
to recognize. For example, it may be appropriate to partition each block of data in 
ir different ways using H' different functions F| . . . Fu where each function yields 
roughly the same subblock sizes, but at different positions in the block. 

Another technique would be to create an additional set of boundaries based on the 
boundaries provided by a hash function. For example, a fractal algorithm could hv 
used to partition a block based upon some other partitioning provided by a function 
F. 

Comparing Subblocks 

in most applications of this invention, there will be a need at some stage to identifv 
identical subblocks. This can be done in a number of ways: 
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• Compare the subblocks themselves. 

• Compare the hashes of the subblocks. 

• Compare identities of the subblocks. 

• Compare references to the subblocks. 

In most cases, the problem reduces to that of taking a group of subblocks of data 
and finding all subsets of identical subblocks. This is a well-solved problem and 
discussion of various solutions can be found in the following books: 

Knuth D.E., ""The Art of Computer Programming: X'olume 1: Funda- 
mental Algorithms". Addison Wesley. 1973. 

Knuth D.E., "The Art of Computer Programming: Volume 3: Sorting 
and Searching", Addison Wesley, 1973. 

In most cases, the problem is best solved l>y creating a data structure t]\al maintains 
the subblocks. or references to the subblocks. in sorted order, and then inserting 
each subblock one at a time into the data strurlurc. Not only docs this identify all 
currently identical subblocks. but it also establishes a structure that can be used to 
determine quickly whether incoming subblocks are identical to any of those already 
held. The following data structures are described in the books referenced above and 
]>rovide just a sample of the structures that could be used: 

• Hash tables. 

• Sorted trees (binary. N-ary. AVL). 

• Sorted linked lists. 

• Sorted arrays. 
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Of the multitude of solutions to the problem of matching blocks of data, one solution 
is \vorth\ of special attention: the hash table. Hash tables consist of a (usually) finite 
array of slots into which values may be inserted. To add a value to a hash table, 
the value is hashed (using a hash function that is usually selected from the class of 
narrow hash functions) into a slot number and the value is inserted into that slot. 
Later, the value can be retrieved in the same manner. Provisions must be made for 
the czLse where two data values to be stored in the same table hash to the same slot 
number. 

Hash tables are likely to be of particular value in the implementation of this invention 
because: 

• They provide very fast (essentially constant time) access. 

• Many applications will need to calculate a strong one-way hash of each sub- 
block anyway, and a portion of this value can be used to index the hash table. 

Particularly effective would be a hash table indexed by a portion of a strong one- 
way hash of the subblocks it stores, with each table entr\ containing (a) the strong 
one-wa}- hash of the subblock. and (b) a pointer to the subblock stored elsewhere in 
memory. 

The Use Of Compression, Encryption, and Integrity Tech- 
niques 

Various aspects of the invention could be enhanced by the use of data compression, 
data encryption, and data integrity techniques. The ai)plications of thesr techniques 
include, but are not limited to the following applications: 

• .Any sul)block that is transmitted or re]>resente<i in its raw form could alter- 
natively be transmitted or represented in a comj)ressed or encrypted form. 
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• Subblocks could be compressed and encrypted before further processing b\' 
aspects of this invention. 

• Blocks could be compressed and encrypted before further processing by aspects 
of this invention. 

• Communications or representations could be compressed or encrypted. 

• Any component could carry additional checking information such bls checksums 
or digests of the data in the component. 

• Ad-hoc data compression techniques could be used to further compress refer- 
ences and identities or consecutive runs of references and identities. 

Storage Of Variable-Length Subblocks On Disk 

The division of data into subblocks of varying length presents some storage orga- 
nization problems if the subblocks are to be stored independently of each other, as 
most hardware disk systems are organized to store an array of fixed-length blocks 
(e.g. one million 5r2-byte blocks) rather than variable-length ones. Here are some 
techniques that could be used to tackle this problem: 

• Each subblock could be stored in an integral immber of disk blocks, witli some 
part of the last disk block l)eiug wasted. For randomly sized subblocks. this 
sciieme will waste on average half a disk block per subblock. 

• Create a small subset of different bucket sizes (e.g. powers of two) and create 
arrays on the disk that pack collections of these buckets efficiently into the 
disk blocks. For example, if disk blocks were ^)\2 bytes long, one could fairly 
efficiently pack five 200-byte buckets into an array of two disk blocks. Each 
subblock would be stored in the smallest bucket size that would hold the 
subblock. with the unused part of the bucket being wasted. 
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• Treat the disk blocks as a vast array of bytes and use well-established heap 
management techniques to manage the array. A sample of such techniques 
appears in pages 435-451 of the book: 

Knuth D.E., "The Art of Computer Programming: Volume 1; Fun- 
damental Algorithms", Addison Wesley. 1973. 

The Use Of Concurrency 

Two processes are said to be concurrent if their execution takes place in some sense 
at the same time: 

• In interleaving concurrency, some or all of the operations performed by the 
two processes are interleaved in time, but the two processes are never both 
executing at exactly the same instant. 

• In genuine concurrency, some or all of the operations performed by the two 
processes are genuinely executed at the same instant. 

Iniplcnientations of the present invention could incorporate either form of concur- 
rency to various degrees. In most of the aspects described earlier, some subset of 
the steps of each aspect could be performed concurrently. In particular (without 
limitation): 

• A block could be split into parts and each part ])artilioned concurrently. 

• The processing of subblocks defined during a sequential partitioning of a block 
need not be deferred until the entire block has been partitioned. In particular, 
the hashes of already-defined subblocks could be calculated and compared 
while further subblocks are being defined. 
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• Communicating entities implementing aspects that decompose and compose 
blocks could execute concurrently. 

• Where more than one block must be partitioned for processing, such parti- 
tioning can occur concurrently. 

Many more forms of concurrency within aspects of this invention could be identified. 

Example: Partitioning A Block 

We now present a simple example of how a block might be partitioned in practice. 
Consider the following block of bytes: 

bi bi 63 64 65 61; b- bs i>9 • • - 

In this example, the example hash function H will be used to partition the block 
and boundaries will be represented by pairs such as 65/67* Wc will assume that H 
returns a boolean value based on its argument and that a boundary is to bo placed 
at each 6,76, + 1 for which F(6, — 2,6, — 1,6.) evaluates to /rue. 

As the hash function accepts 3 byte arguments, we start at 63/6^ and evalu- 
ate //(61.62.63). This turns out to be false, so wc move to 64/65 and evaluate 
//(62.63,64). This turns out to be /ruf so a boundary is placed at 64/6:.. Next, we 
move to 65/60, and evaluate //(6:,, 64, 65). This turns out to be fals( so we move on. 
7/(64. 6,:^, 6(0 is tru( so we place a boundary at 60/67. This process continues until 
the end of the block is reached. 

61 62 63 64 I 6=i 6.i I 67 6s 69... 
Some variations on this approach are: 



47 



WO%/2S801 



PCT/AU96/00081 



• Imposition of a lower bound L on subblock size by skipping ahead L bytes 
following the placement of each boundary. 

• Imposition of an upper bound U on block size by artificially placing a boundary 
if U bytes have been processed since the last boundary was placed. 

• Improving the efficiency of the hash calculations by using some part of the 
calculation of the hash of the bytes at one position to calculate the hash at the 
next position. For example, it may be more efficient to calculate H{x,y.z) if 
//(♦.X, y) has already been calculated. For example, the Internet IP checksum 
is organized so that a single running checksum value can be maintained, with 
bytes entering the window being added to the checksum and bytes exiting the 
window being subtracted from the checksum. 

• Applying this algorithm in reverse starting from the end of the block and 
working backwards. 

• Establishing the subblock enclosing a particular point (chosen from anywhere 
within the block) by exploring in both directions from the point looking for 
the nearest boundary in each direction. 

• Finding all subblock boundaries in one step by evahiating F for all positions 
in parallel. 

Example: Forming A Table of Hashes 

Once a block has been partitioned, the hash of each subblock can be calculated to 
form a table of hashes (Figure 24). 

This table of hashes can be used to determine if a new subblock is identical to any 
of the suliblocks whose hashes aro in table. To do this tljc new subblock's hash 
is calculated and a check made to see if the hash is in the table. 
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In Figure 24. the table of hashes looks like an array of hashes. However, the table of 
hashes could be stored in a wide variety of data structures (e.g. hash tables, binary 
trees). 

Example Application: A File Comparison Utility 

As the invention provides a new way of finding similarities between large volumes 
of data, il follows that it should find some application in the comparison of data. 

In one aspect, the invention could be used to determine the broad similarities be- 
tween two files being compared by a file comparison utility. The utility would 
partition each of the two files into subblocks, organize the h<Lshes of the subblocks 
somehow (e.g. using a hash table) to identify all identical subblocks. and then use 
this information as a framework for reporting similarities and differences between 
the two files. 

In a similar as|)ect, the invention could be used to find similarities between the con- 
tents of large numbers of files in a file system. A utility incorporating the invention 
could read each file in an entire file system, partition each into subblocks and then 
insert tlir subblocks (or heishes of the subblocks) into one huge table (e.g. imple- 
mented by a hash table or a binary tree). If each entry in the lablr carried the nanje 
of the file containing it as well as the position of the subblock within the file, the 
table could later l)e used to identify those files containing identical portions of data. 

If. in addition, a facility was added for recording and comparing thr hashes of tho 
entire contents of files and directory trees, a utility could be constructed that could 
identify all largely similar structures within a file system. Such a utility would be 
immensely useful when (say) attem|)ting to merge the data on several similar backu|) 
ta])es. 
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Example Application: A Fine-Graiiied Incremental Backup 
System 

In a fine-grained incremental backup system, two entities £1 and E2 (e.g. two 
computers on a network) wish to repeatedly backup a file A' at El such that the 
old version of the file Y at E2 will be updated to become a copy of the new version 
of the file X at El (without modifying A"). The system could work as follows: 

Each time El performs a backup operation, it partitions X intosubblocks and writes 
the hashes of the subblocks to a shadow file 5. It might also write a hash of the 
entire contents of A' to the shadow file. After the backup has been completed. A' 
will be the same as V" and so the shadow file 5 will correspond to both A and V. 
Once X is again modified (during the normal operation of the computer system). S 
will correspond onl}' to V . S is used during the next backup operation. 

To perform the backup, El compares the hash of V (stored in S) against the hash 
of A' to see if A' has changed (it could also use the modification date file attribute 
of the file). If A' hasn't changed, there is no need to perform any further backup 
action. If A' has changed. El partitions A' intosubblocks and compares the hashes 
of these subblocks with the hashes in the shadow file S, so as to find all identical 
hashes. Identical hashes identify identical subblocks in ) that can be transmitted 
by reference- El then transmits the file as a mixture of raw subblocks and references 
to subblocks whose hashes appear in S and which are therefore known to api)ear as 
subblocks in V . El can also transmit references to subblocks already transmitted. 
References can take many forms including (without limitation): 

• A hash of the subblock. 

• The number of the subblock in the list of subblocks in V . 

• TIk' number of a subblock previously transmitted. 
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• A range of any of the above. 

Throughout this process El can be constructing the new shadow file corresponding 
to A'. Figure 25 illustrates the backup process. 

To reconstruct A' from V and the incremental backup information being sent from 
El, E2 partitions V into subblocks and calculates the hashes of the subblocks 
(It could do this in advance during the previous backup). It then processes the 
incremental backup information, copying subblocks that were transmitted raw and 
looking up the references either in Y or in the part of A' already reconstructed. 

Because information need only flow from El to E2 during the backup operation, 
there is no need for El and E2 to perform the backup operation concurrently. El 
can perform its side of the backup operation in isolation, producing an incremental 
backup file that can be later processed by E2. 

There is a tradeoff between 1) the approximate ratio between the size of eacli file 
and that of its shadow, and 2) the mean subblock size. The higher the mean sub- 
block size (as determined by the partitioning method used (inchiding F)). the fewer 
subblocks per unit file length, and hence the shorter the shadow size per unit file 
length. However, increasing mean subblock sizes implies increasing the granular- 
ity of backups which can cause an increase in the size of tlie incremental backup 
file. There is also a tradeoff between the shadow file size and the hash width. A 
shadow file that u.ses r2S-bit hashes will be about twice as long as one that uses 
64-bit hashes. All these tradeoffs must b(* considered closely when constructing an 
implementation. 

In a real e.xisling implementation of this backup scheme, the exact format of the 
shadow file 5 is: 
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Bytes Description 



16 MD5 digest of the file Y corresponding to this shadow file. 

16 MD5 digest of the first subblock in Y. 

16 MD5 digest of the second subblock in Y. 

16 HD5 digest of the last subblock in Y. 

16 MD5 digest of the rest of this shadow file. 



The first field contains the MD5 digest (a form of cryptographic hash) of the entire 
contents of V . This is included so that it can be copied to the incremental backup 
file so as to provide a check later that the incremental backup file is not being applied 
to the -ATong version of V. It could also be used to determine if any change has 
been made to .V since the previous backup Y was taken. The first field is followed 
by a list of the MD5 digests of the subblocks in V in the order in which they appear 
in v. Finally, a digest of the contents of the shadow file (less this field) is included 
at the end so as to enable the detection of any corruption of the shadow file. 

The format of the incremental backup file is as follows: 



Bytes Description 



16 MD5 digest of Y. 
16 MD5 digest of X. 

Zero or more ITEMS . 
16 MD5 digest of the rest of the incremental backup file. 



The first two fields of the incremental backuj) file contain the MD5 digest of the old 
and new versions of the file. The hash of the new version A' is calculated directly 
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from X. The hash of the old version is obtained from the first field of the shadow- 
file. These two values enable the remote backup entity E2 to check that: 

• The backup file Y (to be updated) is identical to the one from which the 
shadow file was generated. 

• The reconstructed -V is identical to the original -V. 

The two checking fields are followed by a list of items followed by a checking digest 
of the rest of the incremental backup file. 

Each item in the list of items describes one or more subblocks in the list of subblocks 
that can be considered to constitute .V. There are three kinds of item, and so each 
item commences with a byte having a value one, two, or three to indicate the kind 
of item. Here is a description of the content of each of the three kinds of item: 

1. The 32-bit index of a subblock in V. Because E2 possesses Y. it can partition 
)' itself to construct the same partitioning that was used to create the shadow 
file. Thus £"] doesn't need to send \\\c hash of any subblock that is in both 
.V and y^. Instead, it need only send the index of the subl>lock in the list 
of subblocks constituting V . This list is represented by the list of hashes in 
S. As 32-bits is wide enough for an index in practice, the saving gained by 
communicating a 32-bit index instead of a hash is 9S bits for each such item. 

2. A pair of 32-bit numbers being the index of the first and last subblock of a range 
of subblocks in V. Old and new versions of files often share large contiguous 
ranges of subblocks. The use of this kind of item allows such ranges to be 
represented using just 64 bits instead of a long run of instances of the first 
kind of item. 
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3. A 32-bit value containing the number of bytes in the subblock, followed by the 
raw content of the subblock. This kind of item is used if the subblock to be 
transmitted is not present in 

In the implementation, all the values are coded in little-endian form. Big-endian 
could be used equally as well. 

The existing implementation could be further optimized by (without limitation): 

• Adding an additional kind of item that refers to subblocks in A*^ already trans- 
mitted: 

• Adding an additional kind of item that refers to ranges of subblocks in A' 
already transmitted; 

• Employing data compression techniques to compress the raw blocks in the 
third kind of item. 

• Using the first hash in the shadow file to check to sec if the entire file has 
changed at all before performing the backup process described above. 

• Replacing hashes in 5 of subblocks in V by references to other hashes in S 
(where the hashes (and hence subblocks) are identical). Repeated runs of 
hashes could also be replaced by pointers to ranges of hashes. 

The scheme described above has been described in terms of a single file. However, 
the technique could be ap|)lied repeatedly to cacli of the files in a file system, thus 
providing a way to back up an entire file system. The shadow information for each 
file in the file system could be stored inside a separate shadow file for each file, or 
in a master shadow file containing the shadows for one or more (or all) files in the 
file system. 
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Although most redundancy in a file system is likely to be found within different 
versions of each file, there may be great similarities between versions of different 
files. For example, if a file is renamed, the "new" file will be identical to the ''old" 
file. Such redundancy can be catered for by comparing the hashes of all the files in 
the old and new versions of a file system. In addition, similarities betw^een different 
parts of different files can be exploited by comparing the hashes of subblocks of each 
file to be backed up against the hashes of the subblocks of the entire old version of 
the file system. 

If E2 has lots of space, a further improvement could be for E\ to retain the shadows 
of all the previous versions of the file system, and for E2 to retain copies of all the 
previous versions of the file system. El could then refer to every block it has ever 
seen. This technique could also be applied on a file-by-file basis. 

In a further variant, the dependence on the ordering of subblocks could be abandoned 
and £"1 could simply keep a shadow file containing a list of the hashes of all the 
subblocks in the previous version (or versions) of the file or file system. E'2 would 
then need to record only a single copy of each unique subblock it has r\ er rer<Mved 
from El. 

Aspects of the backup application described in this section can be integrated cleanly 
into existing backup architectures by deploying the new mechanisms within tiic 
framework of the existing ones. For example, the traditional methods for determin- 
ing if a file has changed since the last backup (n)odification date, backup datt* and 
so on) can be used to see if a file needs to l)r i)acked up at all. l>efore applying the 
new mechanisms. 

Example Application: A Low-Redundancy File System 

We now present an example of a low-redundancy file system thai atlem])ts lo avoid 
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storing different instances of the same data more than once. In this example, the 
file system is organized as shown in Figure 26. 

The bottom layer consists of a collection of unique subblocks of varying length 
that are stored somewhere on the disk. The middle layer consists of a hash table 
containing one entry for each subblock. Each entry consists of a cryptographic hash 
of the subblock, a reference count for the subblock, and a pointer to the subblock 
on disk. The hash table is indexed by some part of the cryptographic hash (e.g. the 
bottom 16 bits). Although a hash table is used in this example, many other data 
structures (e.g. a binary tree) could also be used to map cryptographic hashes to 
subblock entries . It would also be possible to index the subblocks directly without 
the use of cryptographic hashes. 

The top layer consists of a table of files that binds filenames to lists of subblocks, each 
list being a list of indexes into the hash table. Each hcish table entry corresponds to 
a single unique subblock (except possibly in the cEise of overflow) and contains the 
cryptographic hash of the subblock along with a reference count and a pointer to 
the subblock on the disk. The reference count records the number of references to 
the subblock that appear in the entire set of files in the file table. The issue of hash 
table "overflow" can be addressed using a variety of well-known overflow technicpjes 
such as that of attaching a linked list to each hash slot. 

When a file is read, the list of haish table indexes is converted to pointers to sub- 
blocks of data using the hash table. If random access to the file is required, extra 
information about the length of the sul)blocks could bo added to the file table and/or 
hash table so as to speed access. 

Writing a file is more complicated. During a sequential write, the data being written 
is bufi'ered until a subblock-boundary is reached (as determined by whatever bound- 
ary function is l>eing used). The cryptographic hash of the new subblock is then 
calculated and used to look up the hash table. If the subblock is unique (i.e. there 
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is no entry for the cryptographic hash), it is added to the data blocks on the disk 
and an entry is added to the hash table. A new subblock number is added to the 
list of blocks in the file table. If, on the other hand, the subblock already exists, the 
subblock need not be written to disk. Instead, the reference count of the already- 
existing subblock is incremented, and the subblock's hash table index is added to 
the list of blocks in the file's entry in the file table. 

Random access writes are more involved, but essentially the same principles apply. 

If a record were kept of subblocks created since the last backup, backing up this file 
system could be very efficient indeed. 

One enhancement that could be made is to exploit unused disk space. Instead of 
automatically ignoring or overwriting subblocks whose reference count has dropped 
to zero, the low-redundancy file system could move them to a pool of unused sub- 
blocks. These subblocks, while not present in any file, could still form part of the 
subblock pool referred to when checking to see if incoming subl)locks are already 
present in the file system. The space consumed by subblocks in the unused sub- 
block pool would be recycled only when the disk was full. In the steady state, the 
''unused'' portion of the disk would be filled by subblocks in the unused subblock 
pool. 

Although this section has specifically described a low-redundancy file system, this 
ELspect of the invention is really a general purpose storage system that could be 
applied at many levels and in many roles in information pror(*ssing systems. For 
rxample: 

• The technique could be used to implement a low-redundancy virtual mem- 
ory system. The contents of memory could be organized as a collection of 
subblocks. 
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• The technique could be used to increase the efficiency of an on-chip cache. 

Example Application: A Commuuication System 

We now present a method for reducing duplicate transmissions in communications 
systems. Consider two entities El and £*2, where El must transfer a block of data 
A' to E2. El and E2 need never have communicated previously with each other. 

The conventional way to perform the transmission is simply for El to transmit X to 
E2. However, here. El first partitions -V into subblocks and calculates the hash of 
each subblock using a hash function. It then transmits the hashes to E2. E2 then 
looks up the hashes in a table of hashes of all the subblocks it already possesses. 
E2 then transmits to El information (e.g. a list of subblock numbers) identifying 
the subblocks in X that E2 does not already possess. El then transmits just those 
subblocks. 

Another way to perform the transaction would be for E2 to first transmit to El the 
hashes of all the subblocks it possesses (or perhaps a well chosen subset of thorn). 
El could then transmit references to subblocks in A' already known to E2 and the 
actual contents of subblocks in X not known to E2. This scheme could be more 
efficient than the earlier scheme in cases where E2 possesses less subblocks than 
there are in A'. 

.Another way to perform tlir transaction is for El and E2 to conduct a more com- 
plicated conversation to establish which subblocks E2 i)Ossesses. For example. E2 
could send El the hashes of just some of the subblocks it possesses (perha])s thr 
most popular ones). El couki then send to E2 the hashes of other subblocks in X, 
El could then reply indicating which of those subblocks it truly does not pos.sess. 
El could then send to E2 the subblocks in A' not possessed by E2. 
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In a more sophisticated system, El and E2 could keep track of the hashes of the 
subblocks possessed by the other. If either entity ever sent (for whatever reason) 
a reference to a subblock not possessed by the other entity, the latter entity could 
simply send back a request for the subblock to be transmitted explicitly and the 
former entity could send the requested subblock. 

The communication application described above considers the case of just two com- 
municants. However, there is no reason why the scheme could not be generalized to 
cover more than two communicants communicating with each other in private and in 
public (using broadcasts). For example, to broadcast a block, a computer Ci could 
broadcast a list of the hashes of the block's subblocks. Computers . • Cv could 
then each reply indicating which subblocks they do not already possess. C\ could 
then l>roadcast subblocks that many of the other computers do not possess, and 
send the subblocks missing from only a few computers to those computers privately. 

All these techniques have the potential to greatly reduce the amount of information 
transmitted between computers. 

These techniques would be very efficient if they were implemented on top of tlie file 
system described earlier, as the file system would already have performed the work 
of organizing all the data it possesses into indexed subblocks. The potential savings 
in communication that could be made if many different computer systems shared 
the same subblock partitioning algorithm suggests that some form of universal stan- 
dar<lization on a particular partitioning method would be a worthy goal. 

Example Application: A Subblock Server 

.Aspec ts of the invention could be used to establish a subblock server on a network 
so as to reduce network traffic, A subblock server could be located in a busy ]>ar( 
of a i»etwork. It would consist of a computer that breaks each block of data it 
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sees into subblocks, hashes the subblocks, and then stores them for future reference. 
Other computers on the network could send requests to the server for subblocks, the 
requests consisting of the hashes of subblocks the server might possess. The server 
would respond to each hash, returning either the subblock corresponding to the hash, 
or a message stating that the server does not possess a subblock corresponding to 
the hash. 

Such a subblock server could be useful for localizing network traffic on the Internet. 
For example, if a subnetwork (even a large one for (say) an entire country) placed a 
subblock server on each of its major Internet connections, then (with tiie appropriate 
modification of various protocols) much of the traffic into the network could be 
eliminated. For example, if a user requested a file from a remote host on another 
network, the user's computer might i.ssue the request and receive, in reply, not the 
file, but the hashes of the file's subblocks. The user's computer could then send the 
hashes to the local subblock server to see if the subblocks are present there. It would 
receive the subblocks that are present and then forward a request for the remaining 
subblocks to the remote host. The subblock server might notice the new sul)blocks 
flowing through it and archive them for future reference. The entire effect would 
be to elin)inatc most repeated data transfers bet ween the sul)network and the rest 
of the Internet, However, the security implications of schen)es such as these would 
need to be closely investigated before there were deployed. 

A further strp could be to create '"virtual" subl)lock servers thai store the hashes 
of subblocks and their location on the Internet rather than the subblocks and their 
hashes. 
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CLAIMS 

Note: Claims 1 to 11 will be referred to as the partitioning claims. 

1. A method for partitioning a block b into one or more subblocks. the method using 
the component: 

(i) a deterministic or non-deterministic function F that returns one of at least two 
values, and whose arguments include at least a block of .4 bits and a block of B 
bits, where A and B are natural numbers; 

and comprising the step of: 

a. Basing the positions of subblock boundaries on the positions h in the block for 
which F(bi,^A • • • ^jl - ^t+i • • • ^k-^B) falls within a predetermined subclass of the set of 
possible function result values. 

2. A method according to claim 1, for locating the nearest subblock boundary on a 
particular side of a particular position }> within a block, but replacing step (a) with: 

a. Evaluating F(6^>_^ . . . 6p, . for increasing (or decreasing) p until the 

result of F falls within a predetermined subclass of the set of possible function result 
values, the position of the resultant boundary being based on this position. 

3. A method according to any claim above, for partitioning a block into one or 
more subblocks. wherein boundaries may be added and removed in accordance with 
a further method. 

•4. A method according to any claim above, for partitioning a block into one or more 
subblocks. wherein an upperbound (' on the subblock size is imi)Osed. 

5. A method according to any claim above, for partitioning a block into one or more 
sub]>locks. wherein a lowerl)ound L on the subblock size is imposed. 
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6. A method according to any claim above, for partitioning a block into one or 
more subblocks, wherein an upperbound U on the subblock size is imposed and a 
lowerbound L on the subblock size is also imposed. 

7. A method according to one or more of the claims above, wherein more than one 
partitioning function (e.g. Fi, F2. . . .) and method are applied independently to the 
block t so as to form more than one group of subblocks. 

8. A method for partitioning a block into one or more subblocks by dividing the 
block into subblocks of equal size. 

9. A method for partitioning a block into one or more subblocks by dividing the 
block into subblocks of a small number of different sizes. 

10. .A method according to one of the claims above, wherein additional subblocks 
are formed from one or more groups of subblocks. 

11. A method according to one of the claims above, wherein an additional hierarchy 
of subblocks is formed from one or more contiguous groups of subblocks. 

12. A method in accordance with any partitioning claim, for partitioning a block 
into subblocks and forming a corresponding collection of hashes, comprising the 
steps of: 

a. Partitioning the block into one or more subblocks in accordance with any parti- 
tioning claim; 

b. Calculating the hash of one or more subblocks using a hash function //. 

13- -A niethod in accordance with any partitioning claim, for constructing a projec- 
tion of a block, comprising the steps of: 

a. Partitioning the block into one or more subblocks in accordance with any parti- 
tioning claim: 
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b. Forming a projection which is an ordered or unordered list containing identities 
(e.g. subblocks or hashes of subblocks) of, or references to. one or more of the 
subblocks. 

14. A method, in accordance with any claim above, for finding identical portions 
within a group of one or more blocks comprising the steps of: 

a. Partitioning one or more of said blocks into one or more subblocks in accordance 
with any claim above; 

b. Comparing the subblocks or the identities (e.g. hashes) of the subblocks. 

15. A method in accordance with any partitioning claim, for representing one or 
more blocks, involving the following components: 

(i) A method for storing and retrieving subblocks; 

(ii) A mapping from block representatives (e.g. filenames) to lists of entries that 
identify subblocks; 

whereby the modification of data in a stored block involves the following steps: 

a. Partitioning the new data into subblocks in accordance with any partitioning 
claim: 

b. Adding subblocks in the new data that are not already in the collection of 
stored subblocks to the collection of stored subblocks. and updating th<* subblock 
list associated with the block being modified: 

16. A method for an entity El to communicate a group A' of one or more snl)l>locks 
A'l . . . A'n to E'2 where El possesses the knowledge that E2 possesses a group V of 
zero or more subblocks Vi . . . V*t., comprising the following ste|>: 
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a. Transmitting from £1 to E2 the contents of a subset of zero or more subblocks in 
A\ and the remaining subblocks as references which may take (but are not limited 
to) the following forms: 

(i) a hash of a subblock; 

(ii) a reference to a subblock in V*; 

(iii) a reference to a range of subblocks in V: 

(iv) a reference to a subblock already transmitted; 

(v) a reference to a range of subblocks already transmitted. 

17. A method in accordance with claim IG and any partitioning claim, for an entitj^ 
El to communicate a block A' to E2 where El possesses the knowledge that E2 
possesses a group Y of subblocks V i . . . coin])nsing step (a) of claim 16 preceded 
by the step: 

s. Partitioning X into subblocks A'l V„ in accordance with any partitioning 

claim. 

18. A method in accordance with claim Hi and any partitioning claim, for an entity 
El to communicate one or more subblocks of a group A' of subblocks A'j . . . A'„ to 
E2 where £1 possesses the knowledge that E2 possesses the block V. comprising 
step (a) of claim 16 preceded by the step: 

s. Partitioning V into subblocks V| . . . V„, in ac cordance with any partitioning claim. 

19. A method in accordance with claim 10 and an\- partitioning claim, for an entity 
£l to communicate a block A' to £2 where £1 possesses the knowledge that £2 
possesses block V. comprising step (a) of claim 16 preceded by the steps: 
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si. Partitioning A' into subblocks A'l ...A'n in accordance with any partitioning 
claim. 

s2. Partitioning Y into subblocks Y] ...V'm in accordance with any partitioning 
clainn. 

20. A method for constructing a block D from a group A' of one or more subblocks 

A'] \n and a group )' of zero or more subblocks Vi . . . Vm such that A' can be 

constructed from Y and D, comprising the step: 

a. Constructing D from at least one of the following components: 

( 1 ) the contents of one or more subblocks in A'; 

(2) references to subblocks in Y or to subblocks included in Z). or to a range of 
subblocks from either D or Y. 

21. A method in accordance with claim 20 and any partitioning claim, constructing 
a block D from a block A' and a group Y of subblocks V'l . . . Vm such that A' can be 
constructed from V and D, comprising stei> (a) of claim 20 preceded by the step: 

s. Partitioning A' into subblocks AV • • A\, in accordance with any partitioning 
claim. 

22. A method in accordance with claim 20 and any partitioning claim, constructing 
a block D from a group A of subblocks A'l . . . A'^ and a block V such that A' can 
be constructed from V and Z), comprising step (a) of claim 20 preceded by the step: 

s. Partitioning V into suljblocks )\ . . , Ym in accordance with any partitioning claim. 

23. -A method in accordance with claim 20 and any partitioning claim, constructing 
a block n from a block X and a block V such that X can l)e constructed from V 
and D. comprising step (a) of claim 20 preceded by the steps: 
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si. Partitioning .V into subblocks A'i...A'„ in accordance with any partitioning 
claim. 

s2. Partitioning Y into subblocks Vi...V^ in accordance with any partitioning 
claim. 

24. A method for constructing a block D from a group .V of subblocks A'l V„ 

and a projection of a block Y (or a projection of a group Y of subblocks 1 1 . . . ), 
such that A' can be constructed from Y and D, comprising the step: 

a. Constructing D from at least one of the following components: 

(1) the contents of one or more subblocks in A'; 

(2) references to subblocks in Y or to subblocks included in D, or to a range of 
subblocks from either D or Y. 

25. A method in accordance with claim 24 and any partitioning claim, constructing 
a block D from a block A' and a projection of Y such that A' can be constructed 
from ^ and D. comprising the step of claim 24 with the following step inseried 
before step (a): 

s. Partitioning X into subblocks A'l . . . A'„ in accordance with any partitioning 
claim; 

26. A method for constructing a block A' (or group A' of subblocks A'l V„) from 

a group Y of subblocks V, . . . V„, and a block D. comprising the step of: 

a. Constructing A' from D and Y by constructing the subblocks of A' based on on<* 
or more of: 

(i) references in D to subblocks in V: 
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(ii) references in D to subblocks in Z?; 

(iii) references in D that specify a range of subblocks in Y: 

(iv) references in D that specify a range of subblocks in D; 

(v) subblocks contained within D: 

(vi) other data elements in D, 

27. A method in accordance with claim 26 and any partitioning claim, for construct- 
ing a block A' (or group .V of subblocks .Vj . . . A',. ) from a block Y and a block D. 
comprising the step of claim 26 with the following step inserted before step (a): 

s. Partitioning Y into subblocks )\ , . , Ym in accordance with any partitioning claim. 

2S. A method for transmitting a group A' of sul)l)locks A'l . . . A'n from one entity 
E\ to another entity E2. comprising the steps of: 

a. Transmitting from E\ to E2 an identity of one or more subblocks; 

b. Transmitting from E'2 to El information conmnniicaling the presence or abscncr 
of subblocks at E2: 

c- El transmitting to E2 at least the subblocks idcMitified in step (b) as not being 
present at E2; 

29. A method in accordanrr with claim 28 and any partitioning claim, for commu- 
nicating a data block A from one entity £"1 to another entity £2, comprising the. 
steps of claim 2S but with the following step inserted before step (a): 

s. Partitioning A' into sul)l)locks A'l . . . A'n in acrordance with any partitioninji. 
claim. 
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30. A method in accordance with claim 13 for comparing the contents of two or 
more blocks comprising the ^^teps: 

a. Constructing a projection of each block as described in claim 13; 

b. Comparing the projections of the blocks. 

31. A method for transmitting a group A' of subblocks A'l V„ from one entity 

El to another entity £2, comprising the steps of: 

a. Transmitting from E2 to El information communicating the presence or absence 
at E2 of members of a group Y of subblocks Vi . . . 

b. Transmitting from E\ to E2 the contents of zero or more subblocks in A\ and 
the remaining subblocks as references which may take (but are not limited to) the 
following forms: 

(i) a hash of a subblock; 

(ii) a reference to a subblock in V*; 

(iii) a reference to a range of subblocks in V; 

(iv) a reference to a subblock already transmitted: 

(v) a reference to a range of subblocks already transmitted. 

32. A method in accordance with claim 31 and any partitioning claim, for trans- 
mitting a block A' from one entity El to another entity E2. comprising the steps of 
claim 31 willi the following step inserted before step (a): 

s. El partitioning A' into subblocks A'j - - . A'„ in accordance with any partitioning 
claim: 
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33. A method for an entity E2 to communicate to an entity £1 the fact that E2 
possesses a group Y of subblocks > i . . . V„j. comprising the step of: 

a. E2 transmitting to £"1 identities or references of the subblocks . . . V;^. 

34. A method in accordance with claim 33 and any partitioning claim, for an entity 
E2 to communicate to an entity El the fact that E2 possesses a block V, comprising 
the step of claim 33 with the following step inserted before step (a): 

s. E2 partitioning Y into subblocks V| in accordance with any partitioning 

claim; 

35. A method for an entity El to communicate a subblock A', to an entity E2. 
comprising the steps: 

a. E2 sending El an identity of A',. 

b. El sending A', to E2. 

36. A method in accordance with claim 35 and any partitioning claim, for an entity 
El to communicate a subblock A', to an entity E2, comprising the steps of claim 35 
with the following step inserted before step (a): 

s. El partitioning a block A' into subblocks A'l . . . A'„ in accordance with any 
partitioning claim; 

37. A method in accordance with any claim above, wherein one or more of the 
comparisons of subblocks are i>erformed by comparing the hashes of the subblocks. 
using hashes already available (e.g. as a byproduct of other steps), or calculated for 
the purpose of performing one or more said comparisons. 

3S. A method in accordance witli any claim above, wherein subsets of identical 
subblocks within a group of one or more subblocks are identified by inserting each 
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sul)block. an identity of each subblock, a reference of each subblock. or a hash of 
eaclj subblock, into a data structure. 

39. A method in accordance with any claim above, wherein various actions are 
executed concurrently. 

40. An apparatus for partitioning a block b into one or more subblocks, the apparatus 
com])rising: 

(i) means for evaluating a deterministic or non-deterministic function F that returns 
one of at least two values, and whose arguments include at least a block of A bits 
and a block of B bits, where .4 and B are natural numbers; 

and comprising the step of 

a- Generating a set of partitions of 6, basing these upon the positions of subblock 
boundaries on the positions k in the block for which r(/>A_.4 - . . <>a+i • - • bk^e) falls 
within a predetermined subclziss of the set of possible function result values. 

41. An apparatus, in accordance with claim 40, for locating the nearest subblock 
boundary on a particular side of a particular position p within a block 6, wherein 
step (a) is replaced with 

a. (;<'nerating a position within 6 by evaluating F(6^_ j, . . ./>^,./*,,+ | . . .tp+a) for in- 
creasing (or decreasing) p until the result of F falls within a predetermined subclass 
of the s(M of possible function result values, the position being based on this position. 

■12. An a[)])aratus for partitioning a block into one or more subblocks comprising 

(i) means for dividing a block into subblocks of equal size. 

A \. An apparatus for partitioning a block into one or more subblocks comprising 
(i) nutans for dividing a block into subblocks of a small number of different sizes. 
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44. An apparatus E] that can communicate a group .V of one or more subhlocks 
A'l . . . A'ri to an entity E2 where El possesses the knowledge that £2 possesses a 
group Y of zero or more subblocks >*i ...>;„, the apparatus comprising 

(i) means for manipulating subblocks, subblock identities, and subblock references: 
and comprising the step 

a. Transmitting from £1 to E2 the contents of a subset of zero or more subblocks in 
A', and the remaining subblocks as references which may take (but are not limited 
to) the following forms: 

(i) a hash of a subblock: 

(ii) a reference to a subblock in V^; 

(iii) a reference to a range of subblocks in V; 

(iv) a reference to a subblock already transmitted; 

(v) a reference to a range of subl)locks already transmitted. 

45. An apparatus for constructing a block D from a group A' of one or more 

subblocks A'l V„ and a group V of zero or more subblocks V, . . . such that A 

can be constructed from V and D. the apparatus comprising: 

(i) means for manipulating subblocks. subblock identities and subblock references: 
and com[)rising the step 

a. Constructing D from al least one of the following components: 

( 1 ) the contents of one or more subblocks in A': 
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(2) references to subblocks in Y or to subblocks included in D, or to a range of 
subblocks fronn either D or V. 

46. An apparatus for constructing a block D from a group A' of subblocks A'j V„ 

and a projection of a block Y (or a projection of a group Y of subblocks ) \ ... Ym ), 
such that A' can be constructed from V and D. the apparatus comprising 

(i) means for manipulating subblocks, subblock identities and subblock references; 

and comprising the step: 

a. Constructing D from at least one of the follow ing components: 

( 1 ) the contents of one or more subblocks in A'; 

(2) references to subblocks in Y or to subblocks included in or to a range of 
subblocks from either D or V . 

47. An apparatus for constructing a block X (or group A' of subblocks A'l . . . A'n) 
from a group )' of subblocks Y i . . . V'^ and a block D. the apparatus comprising 

(i) means for manipulating subblocks. subblock identities and subblock references: 

and comprising the step of 

a. Constructing X from D and V by constructing the subblocks of X based on one 
or more of: 

(i) references in D to subblocks in V: 

(ii) references in D to subblocks in /): 

(iii) references in D that specify a range of subblocks in ) : 
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(iv) references in D that specify a range of subblocks in D: 

(v) subblocks contained within D; 

(vi) other data elenrients in D. 

48. A system for transmitting a group -V of subblocks Xy from one apparatus 

El to another apparatus E2, each apparatus consisting of 

(i) means for manipulating subblocks, subblock identities and subblock references; 
and the systems execution comprising the steps of: 

a. Transmitting from £*! to E2 an identity of one or more subblocks: 

b. Transmitting from E2 to El information communicating the presence or absence 
of subblocks at E2; 

c. El transmitting to E2 at least the subblocks identified in step (b) as not being 
present at E2: 

49. A system for transmitting a group .V of subblocks A'j . . . X,, from one apparatus 
El to another apparatus E2, each apparatus consisting of 

(i) means for manipulating subblocks. subblock identities and subblock references; 
and comprising the steps of: 

a. Transmitting from E2 to El information communicating tli<' prrs<Mice or ahsence 
at E2 of members of a group ) of subblocks ) \ . . . V^: 

b. Transmitting from El to E2 the contents of zero or more subblocks in A', and 
the remaining subblocks as references which may take (but arc not limited to) the 
following forms: 
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(i) a hash of a subblock; 

(ii) a reference to a subblock in V; 

(iii) a reference to a range of subblocks in V; 

(iv) a reference to a subblock already transmitted; 

(v) a reference to a range of subblocks already transmitted. 

50. A system for an apparatus E2 to communicate to an apparatus El the fact that 
E2 possesses a group V of subblocks Vj . . . each apparatus consisting of 

(i) means for manipulating subblocks. subblock identities and subblock references: 

and comprising the step of: 

a. E2 transmitting to E\ identities or references of the subblocks V'l . . . Im- 

51. A system for an apparatus El to communicate a subblock A', to an apparatus 
E2. each apparatus consisting of 

(i) means for manipulating subblocks. subblock identities and subblock references: 
and comprising the steps: 

a. E2 sending El an identity of A',. 

b. El sending A', to E2. 
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AMENDED CLAIMS 

[received by the International Bureau on 24 July 1996 (24.07.96); 
original claims 1-51 replaced by amended claims 1-31 (10 pages)] 

Note: Claims 1 to 9 will be referred to os the partitioning claims. 

1. A method for partitioning a block b into one or more subblocks. the method using 
the component: 

(i) a deterministic or non-deterministic function F that returns one of at least two 
values, and whose arguments include at least a block of .4 bits and a block of B 
bits, where A and B are natural numbers; 

and comprising the step of: 

a. BcLsing the positions of subblock boundaries on the positions k in the block for 
which F(6jt->» • • • bkybk^i . . . 6^+^) falls within a predetermined subclass of the set of 
possible function result values. 

2. A method according to claim L for locating the nearest subblock boundary on a 
particular side of a particular position p within a block, but replacing step (a) with: 

a. Evaluating F(6p_.4 . . .6p,6p+i . . .6p+B) for increctsing (or decreasing) p until the 
result of F falls within a predetermined subclass of the set of possible function result 
values, the position of the resultant boundary being based on this position. 

3. A method according to any claim above, for partitioning a block into one or 
more subblocks. wherein boundaries may be added and removed in accordance with 
a further method. 

4. A method according to any claim above, for partitioning a block into one or more 
subblocks. wherein an upperbound U on the subblock size is imposed. 

5. A method according to any claim above, for partitioning a block into one or more 
subblocks. wherein a lowerbound L on the subblock size is imposed. 
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6. A method according to any claim above, for partitioning a block into one or 
more subblocks. wherein an upperbound I' on the subblock j:ize is imposed and a 
io\verl>ound L on the subblock size is also imposed. 

7, A method according to one or more of the claims above, wherein more than one 
partitioning function (e.g. Fi. F2, . . .) and method are applied independently to the 
block b so as to form more than one group of subblocks. 

S. A method according to one of the claims above, wherein additional subblocks are 
formed from one or more groups of subblocks. 

9. .A method according to one of the claims above, wherein ar. additional hierarchy 
of subblocks is formed from one or more contiguous groups of subblocks. 

10. A method in accordance with any partitioning claim, for partitioning a block 
into subblocks and forming a corresponding collection of hashes, comprising the 
steps of: 

a. Partitioning the block into one or more subblocks in accordance with any parti- 
tioning claim; 

b. Calculating the hash of one or more subblocks using a hash function H. 

11. A method in accordance with any partitioning claim, for constructing a projec- 
tion of a block, comprising the steps of: 

a. Partitioning the block into one or more subblocks in accordance with any parti- 
tioning claim: 

b. Forming a projection which is an ordered or unordered list containing identities 
(e.g. subblocks or hashes of subblocks) of, or references to one or more of the 
subblocks. 
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12. A method, in accordance with any claim above, for finding identical portions 
within a group of one or more blocks comprising the steps of: 

a. Partitioning one or more of said blocks into one or more subblocks in accordance 
with any claim above: 

b. Comparing the subblocks or the identities (e.g. hashes) of the subblocks. 

13. A method in accordance with any partitioning claim, for representing one or 
more blocks, involving the following components: 

(i) A method for storing and retrieving subblocks: 

(ii) A mapping from block representatives (e.g. filenames) to lists of entries that 
identify subblocks; 

whereby the modification of data in a stored block involves the following steps: 

a. Partitioning the new data into subblocks in accordance with any partitioning 
claim; 

b. .Adding subblocks in the new data that are not already in the collection of 
stored subblocks to the collection of stored subblocks. and updating the subblock 
list associated with the block being modified; 

14. A method in accordance with any partitioning claim, for an entity E\ to com- 
municate a block A' to E2 where £"1 possesses the knowledge that E2 possesses a 
group Y of subblocks )'i , . . Vrn- comprising the following steps: 

a. Partitioning A' into subblocks A'i...A'„ in accordance with any partitioning . 
claim; 

b. Transmitting from El to E2 the contents of a subset of zero or more subblocks in 
A', and the remaining subblocks zis references which may take (but are not limited 
to) the following forms: 
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(i) a hasli of a subblock: 

(ii) a reference to a subblock in 

(iii) a reference to a range of subblocks in V: 

(iv) a reference to a subblock already transmitted; 

(v) a reference to a range of subblocks already transn:iitted. 

1"). A method in accordance with any partitioning claim, for an entity £"1 \o com- 
municate one or more subblocks of a group A' of subblocks A'l . . . A'n to £2 where 
E] possesses the knowledge that E2 possesses the block V. comprising the following 
steps: 

a. Partitioning Y into subblocks Yi . . . >'m in accordance with any partitioning claim; 

b. Transmitting from £1 to E2 the contents of a subset of zero or more subblocks in 
A\ and the remaining subblocks as references which may take (but are not limited 
to) the following forms: 

(i) a hash of a subblock; 

(ii) a reference to a subblock in V: 

(iii) a reference to a range of subblocks in V; 

(iv) a reference to a subblock already transmitted: 

(v) a reference to a range of subblocks already transmitted. 

16. A method in accordance with any partitioning claim, for an entity El to com- 
municate a block A' to E2 where El possesses the knowledge that E'2 possesses 
block V. comprising the following steps: 
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a. Partitioning X into subblocks A'j ...-V„ in accordance with any partitioning 
claim; 

I). Partitioning )' into subblocks Vj . . . V„, in accordance with any partitioning claim; 

c. Transmitting from El to E2 the contents of a subset of zero or more subblocks in 
.V, and the remaining subblocks as references which ma\ take (but are not limited 
to) the following forms: 

(i) a hash of a subblock: 

(ii) a reference to a subblock in V: 

(iii) a reference to a range of subblocks in V; 

(iv) a reference to a subblock already transmitted: 

(v) a reference to a range of subblocks already transmitted. 

17. A method in accordance with any partitioning claim, for constructing a block D 
from a block A' and a group V of subblocks ) \ . . ,)m such that A can be constructed 
from V and D, comprising the following steps: 

a. Partitioning A' into subblocks A'j ...A'„ in accordance with any partitioning 
claim; 

b. Constructing D from at least one of the following components: 

(i) the contents of one or more subblocks in A; 

(ii) references to subblocks in V or to subblocks included in D. or to a range of 
subblocks from either D or Y, 
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IS. A method in accordance with any partitioning claim, for constructing a block D 
from a group \ of subblocks .V, . . . A„ and a block V such thai X can be constructed 
from )' and D. comprising the following steps: 

a. Partitioning V into subblocks V| . . . i „, in accordance with any partitioning claim; 

b. Constructing D from at least one of the following components: 

(i) the contents of one or more subblocks in X: 

(ii) references to subblocks in V or to subblocks included in jD. or to a range of 
subblocks from either D or V. 

19. A method in accordance with any partitioning claim, for constructing a block 
D from a block X and a block Y such that A' can be constructed from V and D, 
comprising the following steps: 

a. Partitioning X into subblocks A', . , . A'„ in accordance with any partitioning 
claim; 

b. Partitioning V into subblocks Vi . . . in accordance with any partitioning claimj 

c. Constructing D from at least one of the following components: 

(i) the contents of one or more subblocks in X: 

(ii) references to subblocks in V or to subblocks included in D. or to a range of 
subblocks from either D or V. 

20. .A method in accordance with any partitioning claim, for constructing a block 
D from a block A' and a projection of ^ ' such that X can hv constructed from ) 
and D, comprising the following steps: 
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a. Partitioning .V into subblocks A'l . . . A'„ in accordance with any ])artitioning 
claim: 

1>. Constructing D from at least one of the following components: 

(i) the contents of one or more subblocks in X: 

(ii) references to subblocks in V or to subblocks included in D, or to a range of 
subblocks from either D or ) \ 

21. A method in accordance with any partitioning claim, for constructing a block 

A (or group A' of subblocks A'l V„) from a block V and a block D. comprising 

the following steps: 

a. Partitioning y into subblocks Vj . . . in accordance with any partitioning claim; 

b. Constructing A' from D and Y by constructing the subblocks of A based on one 
or more of: 

(i) references in D to subblocks in V: 

(ii) references in D to subblocks in D: 

(iii) references in D that specify a range of subblocks in V; 

(iv) references in D that specify a range of subblocks in D: 

(v) subblocks contained within D: 

(vi) other data elements in D. 

22. A method in accordance with any partitioning claim, for communicating a data 
block X from one entity £"1 to another entity E2. comprising the following steps: 
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a. Partitioning X into subblocks A'l . . . A'^ in accordance n-ith any partit ioning 
claim; 

I). Transmitting from El to E2 an identity of one or more subblocks; 

c. Transmitting from E2 to El information communicating the presence or absence 
of subblocks at E2: 

d. El transmitting to E2 at least the subblocks identified in step (c) as not being 
present at £"2: 

23. A method in accordance with claim 11 for comparing t:ie contents of two or 
more blocks comprising the steps: 

a. Constructing a projection of each block as described in claim 11: 

b. Comparing the projections of the blocks. 

24. A method in accordance with any partitioning claim, for transmitting a block 
A' from one entity £1 to another entity E2, comprising the following steps: 

a. El partitioning A into subblocks A'l Y,, in accordance with any partitioning 

claim: 

b. Transmitting from E2 to El information communicating the presence or absence 
at E2 of members of a group V of subblocks Vj . . . V„, : 

c. Transmitting from El to £2 the contents of zero or more subblocks in A. and 
the remaining subblocks as references which may take (but are not limited to) the 
following forms: 

(i) a hash of a subblock: 

(ii) a reference to a subblock in Y: 
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(iii) a reference to a range of subblocks in >*: 
(i\ ) a reference to a subblock alread\ transmitted: 
(v) a reference to a range of subblocks already transnnitted. 

25. A method in accordance with any partitioning claim, for an entity E2 to com- 
municate to an entity El the fact that E2 possesses a block V. comprising the 
following steps: 

a. E2 partitioning V into subblocks )\ . . . in accordance with any partitioning 
claim: 

b. £2 transmitting to El identities or references of the subblocks V, 

2G. A method in accordance with any partitioning claim, for an entity El to com- 
municate a subblock A', to an entity E2. comprising the following steps: 

a. fl partitioning a block A' into subblocks A'l ...A'n in accordance with any 
partitioning claim; 

b. £2 sending £1 an identity of A',; 

c. £1 sending A', to £2. 

27. A method in accordance with any claim above, wherein one or more of the 
comparisons of subblocks are performed by comparing the hashes of the subblocks, 
using hashes already available (e.g. as a byproduct of other steps), or calculated for 
the purpose of performing one or more said comparisons. 

2S. A method in accordance with any claim above, wherein subsets of identical 
subblocks within a group of one or more subblocks are identified by inserting each 
subblock. an identity of each subblock. a reference of each subblock. or a hash of 
each subblock. into a data structure. 
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29. A method in accordance with any claim above, wherein various actions are 
executed concurrently. 

30- An apparatus for partitioning a block b into one or more subblocks. the apparatus 
comprising: 

(i) means for evaluating a deterministic or non-deterministic function F that returns 
one of at least two values, and whose arguments include at least a block of A bits 
and a block of B bits, where .4 and B are natural numbers: 

and comprising the step of 

a. Generating a set of partitions of b, basing these upon the positions of subblock 
boundaries on the positions k in the block for which F(6jt_>4 . . . bf^, fe^-f i • • • ^'/c+b) falls 
within a predetermined subclass of the set of possible function result values; 

31. An apparatus, in accordance with claim 30, for locating the nearest subblock 
boundary on a particular side of a particular position p within a block b, wherein 
step (a) is replaced with 

a. Generating a position within b by evaluating F{b^^A ■ • • br,.bp^^ . . . ip+a) for ii^- 
creasing (or decreasing) p until the result of F falls within a predetermined subclass 
of the set of possible function result values, the position being based on this position. 
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STATEMENT UNDER ARTICLE 19 



The applicant forwards herewith replacement pages 6$ - 70 for the 
originally filed pages 6% - 74. 

Claims 8 and 9 are deleted. ClaimlG is deleted and incorporated into 
claims 17, 18 and 19. Claim 20 is deleted and incorporated into 21, 22 and 
23. Claim 24 is deleted and incorporated into claim 25. Claim 26 is 
deleted and incorporated into claim 27. Claim 28 is deleted and 
incorporated into claim 29. Claim 31 is deleted and incorporated into 
claim 32. Claim 33 is deleted and incorporated into claim 34. Claim 35 is 
deleted and incorporated into claim 36. Claims 42 to 51 are deleted. All 
the affected claims h£(ve been appropriately renumbered and their 
dependencies amended. 

In the International Search Report all of the patents cited by the 
Examiner were considered of particular relevance to the novelty of claims 
8 and 9. These claims have been deleted and other claims in which the 
features of claims 8 and 9 were incorporated by reference have been 
amended so as to delete these features. 

Other claims have been amended to better distinguish the invention from 
the cited prior art. 
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